Introduction

European cross-national comparative surveys have proliferated over the last few decades and displayed substantial improvements in methodological quality. The most notable progress came through establishing the European Social Survey (ESS 2002), which explicitly emphasised the standardisation of the survey process to boost comparability (Fitzgerald and Jowell 2010). The ESS joined already well-established comparative surveys, such as the Eurobarometer (EB, since 1974), the European Values Study (EVS, since 1981) or the International Social Survey Programme (ISSP, since 1985), together with other new high-quality projects, e.g., the European Quality of Life Survey (EQLS, since 2003), the Survey of Health, Ageing and Retirement in Europe (SHARE, since 2004). The assortment of independent cross-national survey projects opened new promising avenues for inquiry regarding comparisons of survey results and outcomes across different projects. Existing cross-project comparisons tend to focus on methodological issues, e.g., nonresponse (Groves and Peytcheva 2008; Peytcheva and Groves 2009), data quality (Slomczynski et al. 2017), or survey documentation standards (Kołczyńska and Schoene 2018). On the other hand, in empirical studies, the prevailing norm is for data from a particular project to be analysed separately from other projects, whose results would only sometimes feature in the introduction or the discussion section as descriptive reference points (e.g., Heath et al. 2009). Such an approach restricts the potential benefits of abundant cross-national comparative projects. Combining data from different survey projects might make it feasible to integrate information mitigating time-series or country-coverage lapses in any one project (Kołczyńska 2020; Singh 2020).

The general potential of cross-project data integration has long been recognised—not only regarding the European surveys (Dubrow and Tomescu-Dubrow 2016). Recent advances in survey-data recycling engender great hopes for attempting an integration of viewpoints established by different projects; this regards, in particular, ex-post data harmonisation projects, which employ a wide array of procedures that integrate diverse data sets into one meta-base containing a uniform set of variables. Ex-post harmonisation (henceforth, harmonisation) aims to merge data from different survey projects into a unified analysis-ready dataset, even if the original data were not intended for recycling (Granda et al. 2010). While complete cross-project data harmonisation is typically only possible for socio-demographic variables (Hoffmeyer-Zlotnik and Wolf 2003), it remains partially possible for narrow and specific issues, e.g., corruption (Wysmułek 2019), democratic values and protest behaviours (Słomczyński et al. 2016), European integration (Jabkowski and Cichocki 2022), general social trust (Bekkers et al. 2018), health lifestyles (VanHeuvelen and VanHeuvelen 2021) or national identities and religion (Bechert et al. 2020). In contrast to socio-demographic characteristics, cross-project harmonisation of substantive survey questions faces acute measurement problems. Even when surveys include the same topics, their questionnaire items tend to differ significantly, impacting their implied measurement quality. Therefore, any such contrasts lead to reasonable suspicions regarding the equivalence of harmonised results.

Any harmonisation has to contend with many significant contrasts between survey projects in terms of measurement and representation errors (Kołczyńska 2022). Our analysis focuses on the challenges to harmonisation resulting from the differences in the length of the underlying response scale. Based on a comparison of country-level means from two different survey projects (one using a 2-point and the other an 11-point scale), we take issue with the procedure of linear stretching, which remains common in harmonisations of response scales with different underlying lengths (e.g., Slomczynski et al. 2017; Bechert et al. 2020; Huijsmans et al. 2019; Bekkers et al. 2018; Klassen 2018). We argue that stretching different response scales into a common range of harmonised values may give a semblance of comparability; yet, the resulting mean aggregates retain the response scale effect, which proves especially prominent when harmonising scales of markedly different lengths. Furthermore, our analysis considers the impact of a “don’t know answer” as an explicit response option. This impact, rarely taken into account in ex-post harmonisation projects, proves markedly weaker than that of the response scale length, but still, it constitutes a confounding influence on the cross-project comparability of survey results.

Our analysis is based on a case study contrasting the estimates of trust in the European Parliament (EP trust, hereafter) by two leading European survey projects: the EB and the ESS. The topic of EP trust has received extensive analytical coverage based on survey results of both the EB (e.g., Anderson 1998; Rohrschneider 2002; Harteveld et al. 2013; Torcal and Christmann 2019) and the ESS (e.g., Muñoz et al. 2011; Dotti Sani and B. 2016; Schoene 2019; Grönlund and Setälä 2007). It is also always present in any Standard EB as well as the core module of the ESS, with both projects maintaining the same structure of their relevant question item over the last two decades. Both projects employ similar question wording, yet, they espouse diametrically opposed approaches to response-scale structure. EB questionnaires elicit dichotomous responses (“Tend to trust” vs. “Tend not to trust”), with an additional “Do not know” option (DK) explicitly offered to the respondent (note, EB does not record refusals). In turn, the ESS records responses on an 11-point ordinal scale ranging from 0 “No trust at all” to 11 “Complete trust”; two nonresponse options (“Don’t know” answer and “Refusal”) are not explicitly shown to the respondent, but the interviewer could record them regardless. We examine the challenges involved in harmonising data regarding EP trust from the ESS and the EB by comparing their country-level averages at comparable time points.

Research problem

Cross-project comparisons of survey results are methodologically interesting in their own right; however, understanding them proves crucial for all secondary data users attempting to use harmonisation procedures to mitigate the impact of time-series or country-coverage gaps within individual projects. Notably, the lapses in coverage prove vexing within the major European cross-national surveys. For instance, the EB only covers current member states and candidate countries of the European Union. Hence, it does not provide data on such countries as Norway or Switzerland and stopped conducting surveys in the United Kingdom after Brexit. On the other hand, analyses based on the ESS, which aspires to more comprehensive Europe-wide coverage, face the constant problem of intermittent participation—thirty-eight countries participated in at least one ESS round, but only seventeen in all nine. However, even if no single European survey project provides exhaustive coverage of European countries over time, the overall abundance of different projects means that most countries have been covered most of the time. Considering five major European projects (EB, ESS, EVS, EQLS, and the ISSP), only five micro-states and principalities have not appeared in any of their surveys in the period 2000–2020: Andorra, Lichtenstein, Monaco, San Marino, and the Vatican; other four developing countries have been covered less than five times (Albania, Belarus, Bosnia and Hercegovina, Moldova). On average, the other thirty-eight European countries were covered by 22 individual country-level surveys belonging to one of the five projects (Jabkowski and Kołczyńska 2020b).

The EB and ESS have an excellent potential for complementation, especially given the coverage lapses of the ESS. However, the differences in response scale format between the two projects constitute a major challenge for data harmonisation, which boils down to two principal problems: (1) the length of the question scale and (2) the availability of an explicit “don’t know” answer. These contrasts constitute a major cause for concern for secondary users of harmonised data (Table 1).

Table 1 EP trust—question items EB versus ESS

The overall length of the response scale is known to have a meaningful effect on response styles (Baumgartner and Steenkamp 2001; Van Vaerenbergh and Thomas 2012). The general consensus seems that offering respondents few response options does not provide enough space for reflecting their true opinions; therefore, longer scales supposedly lead to better measurements (Lundmark et al. 2015). However, when faced with very long response options, respondents tend to decrease their cognitive effort involved in expressing their opinions as scale points by choosing the middle response category (Kieruj and Moors 2010; Kieruj 2012); additionally, on multi-item scales, respondents tend to avoid the extremes (Weijters et al. 2013, 2010; De Langhe et al. 2011; Cabooter et al. 2016). On the other hand, when respondents are pushed to provide answers on a dichotomous scale, the tendency to acquiescence will affect the result, especially when their knowledge about the subject of interest is low (Krosnick and Presser 2010), as “agreeing” answers are usually construed as socially acceptable (Krosnick 1991). Thus, the typical monotonic rescaling of values onto a common range employs a simple linear stretching algorithm (de Jonge et al. 2017) and does not mitigate the effects of underlying scale lengths, as it assumes false equivalence between points on different response scales on which respondents react differently. Consequently, simple linear stretching tends to inflate harmonised mean averages when transforming shorter into longer scales (de Jonge et al. 2014; Batz et al. 2016; Dawes 2008).

Regarding the cognitive efforts of the respondent induced by questions concerning issues they do not know much about, another meaningful effect on response styles is the availability of an explicit “don’t know” option (DK). Thus, when faced with the sudden necessity to come up with an opinion, respondents are known to engage in various satisficing behaviours, with the “no opinion” option being chief among them (Krosnick et al., 2002). DK answers are known to be more likely with questions requiring high rates of cognitive effort, while “Refusals” tend to occur more in sensitive questions (Shoemaker et al. 2002). Notably, political questions typically elicit more DKs (Laurison 2015). Furthermore, including DK as a valid response along with other response categories increases its incidence (Bishop et al. 1986; Schuman and Presser 1996). Additionally, given that multinomial rating scales require more cognitive efforts compared to dichotomous scales (Krosnick et al. 2005), when “no opinion” options are not explicitly offered, the respondent may engage in other satisficing tactics, including answering at random (Krosnick 1991), blanket preference for “agree” (Krosnick et al. 1996), or a tendency to opt for the neutral answer on the response rating scale (Alwin and Krosnick 1991).

Data

Both the EB and the ESS belong to the most prominent comparative survey projects in Europe. The European Commission conducts the EB biannually to monitor public opinion in EU member states and occasionally in candidate countries (Nissen 2014). The project relies on multistage samples with random route procedures to select households and the closest-birthday rule for selecting respondents within households. National samples include approximately 1,000 respondents, with 500 respondents in smaller countries. Target populations in the Eurobarometer include the “population of the respective nationalities of the European Union Member States and other EU nationals, resident in each of the 28 Member States and aged 15 years and over” (EB 2018). The EB does not publish detailed survey documentation, e.g., it does not provide information on RRs or other fieldwork outcomes.

In turn, the ESS is an academically driven survey project of European societies conducted biennially since 2002. Despite aiming to cover the entire European continent, participation is skewed towards EU countries. Interviews are conducted face-to-face with “persons aged 15 and over resident within private households, regardless of their nationality, citizenship, language or legal status” (ESS 2018a). The ESS only allows probability samples and explicitly prohibits quota samples and substitutions of non-responding or non-contacted units. After discounting for design effects, all countries must achieve a minimum effective sample size of 1,500 respondents (or 800 in countries with populations of less than 2 million). The ESS makes exhaustive survey documentation readily available.

Our analysis is based on survey data from 8 up-to-date rounds of the ESS (2/2004; 3/2006; 4/2008; 5/2010; 6/2012; 7/2014; 8/2016; 9/2018) and 8 timewise corresponding spring waves of the EB (62.0/2004; 66.1/2006; 70.1/2008; 74.2/2010; 78.1/2012; 82.3/2014; 86.2/2016; 90.3/2018). Country coverage has been restricted to the 14 EU member states participating in both the EB and all ESS rounds. ESS 1 (2002) was left out along with its analogous EB wave, as it was only in 2004 that the major eastern enlargement of the EU occurred, which changed the status of Estonia, Hungary, Poland, and Slovenia, from candidate countries to member states. In total, the study encompasses 112 country surveys within each survey project, with the EB dataset of 119,783 respondents and the ESS including 215,746. The EB waves are usually conducted in a short span of no more than two weeks, while the ESS tends to have its fieldwork execution spread out over multiple months; hence, their fieldwork dates are not identical. Nevertheless, the timing differences are small enough to allow for an assumption that both surveys were measuring the same underlying state of public opinion in each of the included countries.

Data harmonisation procedures

Our analysis employs the widely-used procedure of linear stretching source data onto a common range (de Jonge et al. 2017), also known as the “per cent of the maximum possible score achievable” (Cohen et al. 1999). Simple linear stretching only takes into account the number of response options, disregarding the potential effects of question-wording, item difficulty, differences in the labelling of response scale points, distances between response points, differences in the distribution shapes or the explicit availability of item-nonresponse options (Kolen et al. 2004). Given the multiplicity of potential problems, the resulting biases in the harmonised data cannot be easily enumerated and identified (Singh 2021). Notwithstanding its drawbacks, simple linear stretching remains the sole feasible harmonisation option when at least one of the underlying response scales is dichotomous. For instance, both the observed score equating (Singh 2020; Kolen and Brennan 2004) and the reference distribution method (de Jonge 2017) aim to increase comparability by comparing response distributions; however, they do not apply to dichotomous variables, as the number of comparable response distributions for a 2-point scale is infinite (de Jonge et al. 2014).

Simple linear stretching treats the extremes of underlying response scales as equivalent endpoints of a unidirectional common range, with all remaining response options mapped onto this range. Thus, regardless of the original labelling of response scales, responses are assigned consecutive natural numbers, from i = 1 to the maximum equal to the number of response options, and then the harmonised scores are calculated as \(\left( {i - 1} \right)/{ }\left( {{\text{maximum}} - 1} \right)\). For the dichotomous EB scale: “Tend not to trust” is mapped onto 0, while “Tend to trust” onto 1, with the “Don’t know” answer treated as item-nonresponse. For the 11-point ESS scale, the variable is treated as quasi-continuous with 11 equidistant response options ranging from “No trust at all” – 0, to “Complete trust” – 1, with item-nonresponse options of “Don’t know” and “Refusal. An anonymous reviewer of the first draft of this paper aptly pointed out that simple linear stretching could be improved by employing an alternative approach to the transformation of the underlying variable scores. In this modified procedure, a variant of which has been implemented in the Survey Data Recycling project (Slomczynski et al. 2016: 56–57), the response options are first mapped onto a sequence from i = 1 to the maximum equal to the number of response options by increments of 1, but then their harmonised scores are calculated as \(\left( {i - 0.5} \right)/{\text{maximum}}\). In other words, modified linear stretching cuts the [0;1] interval into as many equidistant segments as there are response options and then assigns each original score to the expected middle of each segment under the assumption of the symmetric distribution of true scores within each segment around its middle value. The 2-point scale would yield values of 0.25 and 0.75 instead of the original 0 and 1; the 11-point scale would be mapped on a sequence from 0.05 to 0.95 instead of the original 0 to 1. The two approaches to linear stretching are visualised in Fig. 1.

Fig. 1
figure 1

Linear stretching: comparing the simple and modified approaches

Linear stretching of a 2-point scale towards an 11-point scale represents a somewhat extreme case, exposing the problems inherent in the approach, which may not seem as apparent when the differences in the number of response options between harmonised variables are less dissimilar. Dichotomous responses represent not specific points on the true construct continuum but rather amalgamations of various construct intensities. Thus, the simple transformation of EB’s “Tend to trust” into the outermost response on the 11-point of the ESS seems counterintuitive, as it pushes respondents with different levels of trust to the harmonised endpoint, while actual survey respondents tend to avoid extreme options on multi-item scales (e.g., Weijters et al. 2010). Many among those choosing EB’s “Tend to trust” option are likely only trusting enough not to go for the polar opposite of “Tend not to trust”. The modified approach would move the harmonised values away from the extremes of the [0, 1] range. The displacement is likely most massive for dichotomous variables, as it alleviates the effect of pushing respondents to extremes they would likely not choose on multinomial rating scales.

Our analyses implement both approaches; however, the main thrust of our argument falls on the commonly used simple linear stretch. Hence, in the interest of clarity, the results presented in the main body of the paper relate to the simple approach, but the same procedures were conducted using the modified approach, with their results included in the online supplementary materials. These materials will be referenced in our discussion at the point when we compare the modified alternative to the simple linear stretching. Since the results yielded by the two approaches are only subtly different, we found that attempting to report them both in the paper would be confusing.

Analytical approach

In order to investigate the impact of responses-scale effects on between-project incomparability of harmonised values, we built multilevel regression models with the harmonised means of EP trust (\(\overline{EP trust}\)) as the dependent variable for each country-level survey. Introducing the second level of analysis, the values of \(\overline{EP trust}\) are nested within countries and years. To ensure the results of our analyses are robust to model design, we fit three models with different specifications. The models include project name (\(PROJECT\)) as a factor indicating the type of response-scale incorporated; it also includes the fraction of item-nonresponse (\(DK\_INR\)) obtained by the two projects as an indicator of satisficing behaviours and a proxy of measurement quality. Furthermore, as a proxy for sample quality, we incorporate the internal criterion of representativeness (Kohler 2007)—the absolute sample bias (\(Abs\_bias\))—a useful measure for secondary data users as it does not require design weights (Jabkowski et al. 2021), which are not available in the EB (Jabkowski and Kołczyńska 2020a), and generally not available in survey datasets (Zieliński et al. 2018).

First, we ran a multilevel cross-classified random-intercept and fixed-slopes Model 1, with surveys nested in countries and years. For surveys in country i, year t, and project p:

$$\overline{EP trust}_{itp} = \beta_{0} + \gamma_{i} + \gamma_{t} + \beta_{1} \times PROJECT + \beta_{2} \times DK + \beta_{3} \times Abs\_bias + \varepsilon_{itp} ;$$
$$\gamma_{i} \sim N\left( {0;\theta_{i} } \right);$$
$$\gamma_{t} \sim N\left( {0;\theta_{t} } \right);$$

where \(\beta_{0}\) is the grand intercept, \(\gamma_{i}\) represents between-countries random intercepts, and \(\gamma_{i}\) means between-years random intercepts, the \(\beta\) terms are coefficients on the independent variables, and \(\varepsilon_{itp}\) is the residual.

Model 2 (with random-intercept and fixed-slopes) adds the interaction term to the regression in order to verify whether the \(PROJECT\) moderates the impact of \(DK\) and \(Abs\_bias\) on the survey-level mean values of EP trust, which is the standard procedure for verifying the existence of moderating effect of any of the variables in the regression analysis (Aguinis 2004; Jose 2013). Thus, for surveys in country i, year t, and project p:

$$\begin{aligned} \overline{EP trust}_{itp} &= \beta_{0} + \gamma_{i} + \gamma_{t} + \beta_{1} \times PROJECT + \beta_{2} \times DK + \beta_{3} \times Abs\_bias \\ &\quad + \beta_{4} \times PROJECT*DK + \beta_{5} \times PROJECT*Abs\_bias + \varepsilon_{itp} , \end{aligned}$$

where \(\beta_{0}\), \(\gamma_{i}\), \(\gamma_{k}\), \(\beta\), and \(\varepsilon_{itp}\) stand as above.

Finally, Model 3 extends Model 2 by releasing slopes of beta-coefficient between countries (i.e., we assumed the effects of \(DK\) and \(Abs\_bias\) on \(\overline{EP trust}\) may vary across countries). The specification of the Model 3 is as follows:

$$\overline{EP trust}_{itp} = \beta_{0} + \gamma_{i} + \gamma_{t} + \beta_{1} \times PROJECT + (\beta_{2} + \gamma_{2i} ) \times DK + \left( {\beta_{3} + \gamma_{3i} } \right) \times Abs\_bias + \beta_{4} \times PROJECT*DK + \beta_{5} \times PROJECT*Abs\_bias + \varepsilon_{itp} ;$$

where \(\beta_{0}\), \(\gamma_{i}\), \(\gamma_{k}\), \(\beta\), and \(\varepsilon_{itp}\) stand as previously, \(\gamma_{2i}\) represents random slopes between countries for \(DK\), and \(\gamma_{3i}\) for \(Abs\_bias\). Additionally, we assumed \(\gamma_{2i} \sim N\left( {0;\theta_{2i} } \right)\) and \(\gamma_{3i} \sim N\left( {0;\theta_{3i} } \right)\).

The analysis was conducted in R (R Core Team 2022), using the following packages: tidyverse (Wickham et al. 2019), lme4 (Bates et al. 2015), and sjPlot (Lüdecke 2021).

Results

The harmonized assessments of EP trust differ substantially between the EB and ESS—a visual comparison is presented in Fig. 2. Every panel presents three distinct pieces of information: (1) circles in the lower deck represent the share of DKs in each respective survey of the EB and the ESS; (2) squares represent the percentage shares of the discrete options offered by the 11-point ESS scale—percentages are calculated for valid indications only, i.e., excluding DKs as item-missing; (3) EB and ESS trend-lines based on the harmonised country-level means for the 2004–2018 time-series (also calculated for valid scale indications). The results in the figure rely on the simple linear stretching procedure; for comparison with the modified approach, please consult the online supplementary materials.

Fig. 2
figure 2

Comparison of EB and ESS estimates of trust in the European Parliament (2004–2018)

Juxtaposing the rescaled country-year mean values of EP trust between the EB and the ESS, the EB (underlying 2-point scale) yields systematically different country-level mean aggregates than the ESS (underlying 11-point scale). Those patterns are apparent in the two trend lines presented on each country panel in Fig. 1. The absolute differences between rescaled means obtained in the EB and the ESS are significantly larger than zero, as confirmed by a one-tailed t-test (the absolute difference is equal to .167 (t = 20.39; p < .001)). For the modified linear stretch, the resulting differences between rescaled means prove smaller but remain significant (the absolute difference is equal to .10 (t = 23.20; p < .001). Hence, the simple linear stretching of different scales onto a common range does not make survey-specific averages mutually interchangeable, and neither does the modified approach despite the decrease in the absolute mean differences. Furthermore, the modified approach increases the similarity of time-series trends between the two survey projects. Crucially, in the simple approach, as presented in Fig. 2, the overall tendency for EB averages to be higher than that of the ESS is reversed in two apparent outlier cases of Spain and the United Kingdom. The modified approach, as visualised in the online supplementary materials, seems to make the trendlines for these two countries conform better to the overall pattern observed in all other countries included in the comparison.

On the other hand, both simple and modified linear stretching ignore the explicit availability of the no-opinion option, so their results are identical in this respect. The EB records DKs more often, which is apparent when looking at the relative size and fill transparency of the circles within each country panel in Fig. 2. The differences are statistically significant, as confirmed by a one-tailed paired t-test (the mean difference between country-year EB’s and ESS’s surveys is equal to .063 (t = 12.95, p < .001)). Thus, in line with theoretical expectations, the explicit offering of DK among the valid response options results in a higher share of respondents choosing it.

To estimate the response-scale effect on harmonised trust assessments, we built three multilevel regression models, which also incorporate item nonresponse (i.e., the fraction of DK answers) and absolute sample bias among the covariates and factors of the mean value of EP trust. Table 2 provides regression estimates (Est.) alongside standard errors (SE). The results presented rely on simple linear stretching; for those of the modified approach, please consult the online supplementary materials.

Table 2 Results of multilevel regression modelling–the simple linear stretching procedure

Model 1 tests whether the dichotomous response option in the EB results in the higher values of harmonised means of EP trust than those obtained based on the 11-point ESS response scale. It controls for the share of DK and the size of \(Abs\_bias\), as well as for sample-level averages being nested within countries and waves. As the regression parameter for the project is significantly below zero, i.e., harmonised EB country-level means are notably higher than the ESS country-level means (on average, ESS’s means are 15 pp. lower). Note that the results of model 1 echoes of our descriptive findings presented in Fig. 2.

As the EB and ESS have different approaches to presenting DK options, and both projects utilised different sampling and fieldwork procedures, the share of DK answers and the size of \(Abs\_bias\) may have different effects on the harmonised country-level means within each project. Hence, even if Model 1 does not provide evidence for the significant impact of DK and \(Abs\_bias\) on harmonised country-level mean values, this may stem from the moderating effect of the projects’ characteristics on both relationships. Model 2 partly supports this presumption: the ESS moderates the effect of DK on country-level harmonised means observed in the EB; however, the effect of \(Abs\_bias\) is similar for both projects (the latter is not statistically significant regardless of whether we consider EB or ESS data). Considering the impact of the fraction of the DK on the harmonised means of EP trust in the EB, the statistically significant beta-coefficient equal to .335 means the higher the share of DK answers, the higher the mean EP trust observed in each country survey; for the ESS, the relationship reversed. Figure 3 presents the moderating effect of the project on the relationship between the fraction of DK and the mean value of EP trust.

Fig. 3
figure 3

Moderating effect of the project on the association between the share of DK and EP trust

Note that when we included the moderating effect of the project on the impact of DK on average EP trust, we reduced the beta coefficient for the project (from − .15 in model 1 to − .07 in model 2). Still, the harmonised country-level mean values of EP trust are significantly higher when considering EB results (the project's effect remains significant).

Finally, model 3 allows for the between-country differences in slopes of beta coefficient for DK and \(Abs\_bias\) for controlling whether the effects of both variables are the same in every country. The regression coefficients remained practically the same; thus, the findings based on model 2 remain valid. However, the random-effects analysis demonstrates that countries differ in the size of the impact of DK and \(Abs\_bias\) on country-level averages of EP-trust; nevertheless, they remain similar when we consider the direction of the impact. In the EB, except for the UK, the effect of DK and \(Abs\_bias\) is uniformly positive (for the UK, negative). When we included the moderating effect of the ESS on DK impact on country-level means, the effect of DK is negative (except in the UK, where it is positive). As the ESS does not moderate the effect of \(Abs\_bias\), we can conclude cross-country variation in the direction of the impact of \(Abs\_bias\) on country-level averages remains the same as in the EB.

The same models have also been estimated for the modified approach to linear stretching (for details, consult the online supplementary materials). Despite the descriptive differences between the two approaches in country-level means, using the modified approach does not change the significance of estimated effects.

Discussion

Our case study of EP trust focuses on two extremely different scale formats (2-point and 11-point). The extreme difference magnifies the negative consequences of linear stretching, demonstrating that the harmonised results cannot be treated as equivalent without accounting for the response scale format. While such challenges to equivalence may seem lesser when harmonising variables of similar underlying scale lengths, linear stretching would also be problematic due to unrealistic assumptions regarding equidistance and equivalence of response options and the identical shapes of their empirical distributions. Dealing with 2-point scales only amplifies those inherent problems of stretching, which are not specific, however, to this particular scale format. The linear stretching onto a common range might provide a semblance of comparability, yet, the EB and the ESS yield systematically different results after harmonisation. Hence, not even in the relatively straightforward case of trust in the European Parliament, there seems to be no obvious way of using harmonisation to mitigate their respective coverage lapses by substituting missing information with data from other timewise corresponding country-level measurements.

In principle, the response-scale effects could be incorporated into analyses alongside other measurement characteristics whenever retained as meta-data by the producers of harmonised datasets. Regarding response scale characteristics, harmonisation projects usually provide ready access to information on the length of response scales alongside other characteristics of the underlying questionnaire items (Kołczyńska and Slomczynski 2018; Mauk 2020). These typically include methodological controls for harmonised variables, such as the presenting response option in an ascending or descending order, the number of source items used to create the target variable, the between-project differences in question-item conceptualisations, and the quality of survey documentation (Slomczynski and Tomescu-Dubrow 2018; Granda et al. 2010). On the other hand, the number of meta-characteristics within any general harmonisation, i.e., data recycling without specific research questions, must remain focused on encoding such characteristics in common demand due to obvious resource constraints. The catalogue of all potentially useful characteristics is too large for any general harmonisation. The type of the “no-option” response appears to be one of such less common characteristics; the degree of cognitive burden induced by a question item constitutes another example of this category. Including such more idiosyncratic variables and meta-characteristics typically means, however, that before being users the researchers have to become secondary data producers.

Even if all relevant meta-characteristics of underlying response scales are made available, there is no straightforward way for the secondary data users to incorporate them into analyses of harmonised data (Slomczynski et al. 2021). As our analysis demonstrates, the response scale effects are significant but not uniform across all countries and time points. These effects seem to be sensitive to the social and political context. In the case of trust in the European Parliament, far more is known about the topic than could be feasibly incorporated into a harmonised database. Some domain-specific information could be retained by specific harmonisation. For instance, a quasi-arbitrary weighting factor in a harmonisation scheme could represent the fact that survey questions regarding the abstract notion of parliamentary trust at the European level put a high cognitive burden on most respondents who may lack the relevant skills and knowledge to formulate informed judgments (Anderson 1998). Furthermore, many more variables could be targeted for harmonisation, given that trust in the European Parliament is known to be associated with a variety of other factors, e.g., support for EU membership (Gabel 2003), overall quality of national institutions (Rohrschneider 2002), general social trust (Dellmuth and Tallberg 2020) or life satisfaction (Listhaug and Ringdal 2008). However, there seem to be clear limits of size and complexity, restricting the amount of information that could be meaningfully incorporated even in a highly specific harmonisation.

No matter how exhaustive the inventory of harmonised meta-data, there seems to be no viable way to escape the reality that surveys are conducted in specific socio-political contexts. While such contextual embedding may have minimal impact on the measurement of socio-demographic characteristics, questions regarding substantive issues such as opinions or attitudes are likely to be significantly influenced by such contextual idiosyncrasies. For instance, the cases of Spain and the United Kingdom show that no analysis of trust in the European Parliament can abstract from the context of their political realities. In the UK, the debate over Brexit, i.e., the ultimate abandonment of EU membership, proved among the most polarising political topic of the decade (Hobolt et al. 2021). Unlike in most other EU member states, this translated into substantial and continuous public sphere attention devoted to European institutions (North et al. 2021). When it comes to Spain, at the time of the major shift in the EP-trust averages reported by the EB, the country was enduring a prolonged and devastating economic downturn, which was widely perceived in the context of the broader crisis of the Euro (the European common currency). While not as extreme as in the case of Greece, the Spanish position vis-à-vis the European institutions grew antagonistic, shattering the previously prevailing pro-European consensus (Real-Dato and Sojka 2020). The polarisation of opinion can also be observed when looking at the per cent-share of the most-negative answer on the ESS scale, yet, due to the overall dominance of the neutral option on the 11-point scale, this polarising shift does not register strongly on the ESS mean-average. On the other hand, the effects of underlying public opinion polarisation seem to translate into substantial shifts registered on a 2-point scale. While Spain and the UK present the most extreme cases of this pattern within the 14 countries under consideration, similar effects appear in France, Ireland, and Slovenia.

Concerning the effects of polarisation, the modified approach to linear stretching shows some promise for the harmonisation of dichotomous and multinomial response scales, as lessening scale polarisation makes the country-level averages more similar. However, despite some apparent advantages, the modified approach does not seem a necessarily superior alternative to simple linear stretching. Although it would constitute a targeted improvement for dichotomous variables, this approach has a potentially critical flaw from the point of view of harmonisation: it does away with a common variable range. As the variability of 2-point scales is constrained to [0.25, 0.75] and that of 11-point scales to [0.05, 0.95], the harmonised trust average for any category of EB respondents could not exceed 79% of the maximum potential average of the ESS. Hence, as minimums and maximums of variables become dependent upon their original response scales, harmonisation can no longer be construed as mapping onto a common range.

Conclusion

The popularity of linear stretching likely stems primarily from its ready applicability as well as conceptual and computational simplicity. Other ex-post data harmonisation methods, such as those based on reference distribution, scale interval or multiple imputations with calibration samples, all face meaningful applicability restrictions. For instance, they are not well suited for all types of response scales and prove especially thorny when combining dichotomous and polychotomous variables; furthermore, they require auxiliary information not usually made available in the datasets or published documentation (de Jonge et al. 2014; Kolen et al. 2004). Thus, these alternative methods usually prove impractical for secondary data users, especially those engaged in large-scale analyses (Jabkowski et al. 2021). Notwithstanding its practical applicability, simple linear stretching makes questionable assumptions that (1) the distances between the response options are equal, (2) the question wording and labelling of the response options are irrelevant to the analysis, (3) the response distributions have an identical shape. Consequently, linear stretching relies only on the number of response options of the primary scale, and the transformation ignores all additional scale-specific characteristics. Thus, such harmonisation usually fails to yield comparable mean scores (Batz et al. 2016).

Comparability of mean scores constitutes a crucial challenge for using harmonised data in substantive analyses of survey results. Without systematically accounting for the differences in measurement quality, data harmonisations based on simple linear stretching show limited utility for substantive cross-project comparisons or dealing with gaps in the geographical coverage of some survey projects. Stretching creates a semblance of cross-project equivalence by mapping the underlying response values onto a common range while ignoring the differences in their measurement quality. Ignoring differences does not neutralise them. They are retained in the distributions of variables, which is expressed in the significant differences in mean scores. Although the differences in measurement quality could be included in harmonisation projects and incorporated into modelling by secondary data users, our analysis suggests no straightforward way of achieving that goal.

Our analysis indicates that harmonisation based on simple linear stretching involves inherent and irredeemable flaws. This general conclusion is admittedly based on a narrow case of a case study of EB and ESS trust measurements in the European Parliament and identified systematic cross-project differences in country-level estimates over time. We attributable these differences to contrastive satisficing behaviours elicited by their divergent response-scale formats, i.e., EB’s dichotomous (“Tend to trust” vs. “Tend not to trust”) with an explicit DK option and ESS’s 11-point ordinal scale (from “No trust at all” to “Complete trust”) without and explicit DK option. Our study attested two major cross-project contrasts. Firstly, we found the EB registers a significantly higher incidence of DK answers, commonly treated as item-nonresponse in both survey projects. Secondly, when it comes to valid indications, after simple linear stretching of original responses onto a common range, EB tends to register higher levels of trust than the ESS. Even though the contrastive patterns in the results of the two survey projects can be attributed to their different response scale formats, these effects can be moderated by the socio-political contexts in which surveys are conducted in particular countries.