1 Introduction

Improving the measurement of political solidarities, defined as one’s willingness to share the costs that result from public redistribution that favours people other than oneself, is an important endeavour in political science and adjacent research fields. Political solidarities are a multi-dimensional concept whose measurement touches upon the literature on welfare state attitudes and deservingness, forming a part of a broader measure of social cohesion: One’s willingness can be influenced by perceptions of the target group of redistribution (e.g. children, immigrants, or the unemployed), perceptions of the system of redistribution (e.g. local, regional, national, or supranational), and perceptions of the political system’s effectiveness (e.g. political trust) (Goerres 2021).

It is of particular importance to measure political solidarities in a methodologically sound way. First, there is a large and growing literature on welfare state attitudes in political science, sociology and economics, some studies of which may use survey measures of questionable quality (see, for example, Lundmark et al. 2016). In the three decades since 1991, the mention of welfare state attitudes as a main topic of a research article has more than tripled in social science abstractsFootnote 1 (1991–2000: 86, 2001–2010: 161, 2011–2020: 321). Second, some survey measures of political solidarities, such as whether the state should redistribute from the rich to the poor, have become standard questions (or items) in general social surveys with somewhat unclear benefits regarding their validity.Footnote 2 Third, many advanced industrial democracies have to deal with various large-scale changes, some of them exogenous shocks, to their welfare state systems, such as mass immigration, the rise of the populist right, growing inequality, economic crises, or the Covid-19 pandemic. These changes have the potential to undermine the underlying social contract between the citizens and the state that allows extensive redistribution within the welfare state. Governments are becoming aware of these changes and are reacting to them by turning to academia for answers. For example, the German Parliament (Deutscher Bundestag) decided to install a federal institute for social cohesion whose objective is to advise the government on how to deal with the changed situation (original decision on 10th November 2016; funding started in 2018). Finally, and probably most importantly, changes seem to be taking place in the latent cognitive maps of citizens in Europe in their relationship to the state with increasing emphasis on how much is redistributed and how and to whom that redistribution should take place (Cavaillé and Trump 2015).

The quantitative measurement of political solidarities has often been approached under the heading of measuring welfare state attitudes (Goerres and Prinzen 2012); i.e. individual assessments of welfare state institutions, policies and spending therein. Political solidarities are a subset of welfare state attitudes, focusing on the willingness to finance welfare state activities favouring others (Goerres and Tepe 2010). In that literature, certain scales of welfare state attitudes from large international comparative surveys (see, for example, European Social Survey [ESS] and International Social Survey Programme [ISSP]) are commonly used. In our study, we draw upon those scales. The ESS development team, in particular, broadened the measurement of welfare state attitudes immensely in its survey rounds 4 (2008) and 8 (2016). Recently, efforts have been made to expand the types of welfare state services towards which respondents’ attitudes are measured. This is done, for example, by including survey questions on Early Childhood Education policy areas (Neimanns and Busemeyer 2021), confronting respondents with trade-offs between policy areas (Busemeyer and Garritzmann 2017), presenting respondents complex, multidimensional vignettes (Gallego and Marx 2017), or having respondents reveal their concrete willingness to pay for certain policy reforms (Boeri et al. 2001). There is thus an uncoordinated effort and abundance of related attempts to measure political solidarities. What is missing, however, is a concrete quality assessment of established measures in a rigorous measurement set-up focussing on the quality of these scales.

Considering the increasing importance of political solidarity measures for political science and adjacent research fields, in this paper, we experimentally investigate the response effort (response times) and data quality (criterion validity) of existing and newly designed rating scales of survey measures on political solidarities and related concepts. Survey measures of high data quality are a pre-requisite for drawing correct and robust conclusions. We pre-registered our study, including the research questions and analysis plan, at the Open Science Framework.

In what follows, we outline the methodological background on rating scale design and present our research questions. We then describe the experimental design, the survey questions we use, the data collection and study procedure, and the sample characteristics. After that, we present the results of our study and, finally, provide a discussion and conclusion, including perspectives for future research.

2 Methodological background and research questions

Numerous national and international social surveys, such as the CROss National Online Survey (CRONOS), which is part of the ESS, regularly measure respondents’ attitudes towards and opinions on political solidarities and related concepts, such as redistribution and social trust. In order to measure these constructs, researchers commonly make use of rating scales (i.e., closed answer formats with an ordered list of options). When it comes to rating scales, certain design aspects must be taken into consideration by researchers because these aspects can have a profound impact on respondents’ answer behaviour and thus on response effort and data quality (DeCastellarnau 2018; Krosnick and Presser 2010; Menold and Bogner 2014; Schaeffer and Dykema 2020; Schaeffer and Presser 2003).

For example, decisions must be taken with respect to

  1. 1.

    the scale length (i.e. number of scale points),

  2. 2.

    the scale verbalization (i.e. completely or end verbalized),

  3. 3.

    the inclusion of non-substantive answer options (e.g. “don’t know” or “no opinion”),

  4. 4.

    the scale polarity (i.e. unipolar or bipolar),

  5. 5.

    the inclusion of numeric values (i.e. whether the scale points are provided with or without numbers),

  6. 6.

    the scale direction (i.e. decremental or incremental),

  7. 7.

    and the scale alignment (i.e. horizontal or vertical).

In this study, the first three design aspects—scale length, scale verbalization, and non-substantive answer options—are of primary interest, because research indicates that they have the potential to affect the answer behaviour of respondents. Thus, in this section, we outline the current state of research on these three design aspects.

Based on the range-frequency model by Parducci (1983), scale length is a key aspect of rating scales, because it influences respondents’ understanding of the underlying rating dimension and determines the degree of differentiation (see Menold and Bogner 2014). Literature reviews by Krosnick and Fabrigar (1997) as well as Krosnick and Presser (2010) indicate that five- and seven-point scales work best in terms of reliability and validity (see also DeCastellarnau 2018 for a comprehensive overview). In addition, some evidence suggests that respondents prefer five- and seven-point rating scales over other scale lengths (Krosnick and Fabrigar 1997). One reason for this finding might be that shorter rating scales (less than five points) do not allow sufficient differentiation between answer options, whereas longer rating scales (more than seven points) impede proper differentiation between answer options. However, studies by Tourangeau et al. (2017) as well as Höhne, Krebs, and Kühnel (Under Review) reveal that seven-point rating scales, compared to five-point rating scales, are more prone to primacy effects. Specifically, with seven-point rating scales, respondents’ answers shifted towards the beginning of the rating scale, producing systematic measurement error. Thus, it seems wise to give preference to rating scales with five points rather than with seven points.

Like scale length, scale verbalization is a key aspect to consider when designing rating scales (see DeCastellarnau 2018; Krosnick and Presser 2010; Menold and Bogner 2014; Schaeffer and Dykema 2020; Schaeffer and Presser 2003). The main reason is that verbal labels for all options (i.e., completely verbalized) or only for the end options (i.e., end verbalized) convey crucial information that respondents, being “cooperative communicators” (Schwarz 1996), use in order to understand and answer survey questions meaningfully (Höhne et al. 2021b; Höhne and Yan 2020; Parducci 1983; Sudman et al. 1996; Toepoel and Dillman 2011; Tourangeau 2004; Tourangeau et al. 2007). For example, Höhne et al. (2020, 2021a) compared completely and end-verbalized unipolar and bipolar rating scales with five points. The authors found that end verbalized rating scales perform best in terms of measurement properties, irrespective of scale polarity. Specifically, end verbalized unipolar and bipolar scales result in similar answer distributions, are invariant, and have equidistantly distributed scale points. The authors see the unlabelled centre of the rating scales as responsible for the effect, as they give the impression of equally distanced intervals. Since equidistance is a pre-requisite for the use of rating scales (see Mohler et al. 1998; Rohrmann 1978; Stevens 1946), the use of end verbalized rating scales appears preferable.

Finally, in line with satisficing theory, employing non-substantive answer options may be problematic, because it fosters (strong) satisficing answer behaviour (Krosnick 1991, pp. 219–220). To put it differently, offering non-substantive answer options represents an easy way for respondents to avoid answering survey questions meaningfully. For this reason, some authors recommend not including non-substantive answer options in rating scales (see, for instance, Gilljam and Granberg 1993; Krosnick 1991; Krosnick and Presser 2010; Krosnick et al. 2001; Saris and Gallhofer 2014). Krosnick et al. (2001), for example, analysed data from nine survey experiments investigating the impact of non-substantive answer options on respondents’ answer behaviour. Interestingly, the authors show that the selection of non-substantive answer options was highest among low educated respondents and appears in questions placed towards the end of the survey. They conclude that non-substantive answer options do not improve data quality, but rather preclude the collection of meaningful answers from respondents.

Considering our previously inferred design recommendations with respect to scale length, scale verbalization, and non-substantive answer options it seems best to employ five-point, end verbalized rating scales without non-substantive answer options. First, this scale length produces good data quality and appears to be preferred by respondents. Second, this type of scale verbalization shows good measurement properties in terms of equidistance. Finally, excluding non-substantive answer options may prevent the occurrence of (strong) satisficing answer behaviour.

In this study, we comprehensively searched numerous scientific articles and established social surveys, such as the ESS, for questions addressing political solidarities and related concepts. Based on our search, we compiled a total of 16 survey questions on redistribution, governmental scope, social trust, and welfare chauvinism. The rating scales of these questions varied significantly and, from a methodological perspective, their design might be open to improvement following the previously outlined recommendations. For example, some questions were accompanied by four-point, completely verbalized rating scales with a non-substantive answer option, while some others were accompanied by eleven-point, end verbalized rating scales. In line with the previously outlined design recommendations, we developed five-point, end verbalized rating scales for all survey questions under investigation while maintaining the original question stems and statement formulations. We then conducted a survey experiment in an online access panel in Germany (N = 1513) to systematically test the original and improved rating scales in terms of response effort and data quality. We address the following two research questions:

  1. 1.

    Do the methodologically improved survey questions, compared to the original ones, decrease response effort in terms of response times?

  2. 2.

    Do the methodologically improved survey questions, compared to the original ones, increase data quality in terms of criterion validity?

By addressing these two research questions our study stands out from previous studies for several reasons: (1) much of the existing research was conducted before the emergence of contemporary online surveys (see, for example, DeCastellarnau 2018; Krosnick and Presser 2010), (2) research in this area emphasizes the lack of studies (experimentally) investigating questions on political solidarities and related concepts (Lundmark et al. 2016), (3) the existing research frequently only considers single design aspects, such as polarity (see, for example, Höhne et al. 2020), but does not test multiple design aspects simultaneously, and (4) most studies do not investigate response effort.

3 Method

3.1 Experimental design

We used a between-subject design. Respondents were randomly assigned to one out of two experimental groups. The first group (n = 726) received survey questions with rating scales that were taken from established social surveys (original condition). The second group (n = 787) received the same survey questions but with the methodologically improved rating scales (improved condition).

3.2 Questions

Target questions We employed 16 target questions that we adopted from scientific articles and established social surveys, such as the ESS. The 16 questions are thematically grouped: redistribution (3 questions), governmental scope (5 questions), social trust (3 questions), and welfare chauvinism (5 questions). For each target question, we developed methodologically improved rating scales (improved condition), while maintaining the phrasing of the original question stems and statement formulations. The 16 target questions were presented at the beginning of the online survey in order to prevent carry-over effects from previous questions. We presented one target question per online survey page (single question presentation). The original German question wordings can be found in the pre-registration on Open Science Framework (see https://osf.io/vzwr3?view_only=fb32a31bf37549daa1192d4501441d12). Appendix 1 shows the English translations of the target questions, including the rating scales, and Appendix 2 displays screenshots of the survey questions.

Criterion questions: We used 5 survey questions on political attitudes as criterion measures in order to evaluate criterion validity. For redistribution, governmental scope, and welfare chauvinism, we used one question on the willingness to expend taxpayer money on social benefits and one question on the willingness to facilitate immigration of foreigners. For social trust, we used three questions on political trust (trust in parliament, trust in politicians, and trust in parties). These questions were presented in the third quarter of the survey.

Determining criterion validity is an established method that has been used in previous research (see, for instance, Höhne and Yan 2020; Yeager and Krosnick 2012). The 5 questions were chosen as criterion questions because they are conceptually relevant to the topics of the target questions. In addition, they correlated significantly with all the experimentally manipulated target questions.Footnote 3 In order to determine criterion validity, we investigate which of the two conditions (original or improved) results in higher correlations between the target questions and the criterion questions. Higher (lower) correlations indicate higher (lower) criterion validity. The original German question wordings can be found in the pre-registration on Open Science Framework (see https://osf.io/vzwr3?view_only=fb32a31bf37549daa1192d4501441d12). Appendix 3 shows the English translations of the criterion questions, including rating scales.

3.3 Data collection and study procedures

Data were collected in the Forsa Omninet Panel (omninet.forsa.de) in Germany from 28th July to 16th August 2021. Forsa drew a cross-quota sample from their online panel based on age (young, middle, and old), gender (female and male), and education (low, middle, and high). In addition, they drew quotas based on region (East and West Germany). The quotas were calculated based on the German Microcensus (2019), which served as a population benchmark. The data, including analyses code, are available for replication purposes via the platform Harvard Dataverse (see https://doi.org/10.7910/DVN/XKERRU). This study was pre-registered via the platform Open Science Framework on 27th July 2021.

Forsa invited respondents via email (including two rounds of reminders). The email informed respondents that they would participate in an online survey conducted by the University of Duisburg-Essen (Germany). In addition, it included a link directing respondents to the online survey. On the first page of the online survey, respondents were introduced to the topic (i.e. social and political attitudes) and the procedure of the online survey. Respondents also received a statement of confidentiality assuring them that the study adheres to existing data protection laws and regulations. In addition, the study was approved by the ethics committee of the department of Computer Science and Applied Cognitive Science of the University of Duisburg-Essen (Germany).

We also collected several types of paradata, such as response times, using the open-source tool “Embedded Client Side Paradata (ECSP)” developed by Schlosser and Höhne (2018). Prior informed consent for the collection of paradata was obtained by Forsa as part of the respondents’ registration process. In addition, respondents received modest financial compensation for their participation from Forsa.

Forsa invited a total of 2,848 respondents to participate in the online survey, of which 1115 (39%) did not react to the survey invitation, 168 (6%) were screened out because quotas were already achieved, and 52 (2%) did not finish the online survey. This leaves 1,513 respondents available for statistical analyses (participation rate of about 53% from the panel of volunteers).

3.4 Sample characteristics

Respondents were aged between 18 and 88 years, with a mean age of 52 years (SD = 17 years), and 49% of them were female. In terms of education, 33% completed lower secondary school or less (low education level), 27% intermediate secondary school (medium education level), and 41% college preparatory secondary school or university (high education level).

In order to evaluate the effectiveness of random assignment and the sample composition between the two experimental groups, we conducted several statistical tests. The results revealed no statistically significant differences between the experimental groups with respect to age, gender, and education.

4 Results

For comparability, we initially recoded the answer options of the survey questions to a scale running from 0 to 1. This was done for the 16 target questions as well as for the 5 criterion questions used in this study. In all analyses, we use a p-level of 0.05 to determine statistical significance. We used Stata (version 17) for data preparation and analyses.

4.1 Answer distributions

In the first step of our analsis, we investigated the answer distributions of the 16 target questions. Since the rating scales partially differ in length (e.g. eleven points in the original condition and five points in the improved condition), we decided to compare the means of the scales ranging from 0 to 1 instead of proportions. Accordingly, we conducted two-sample Student t-tests.Footnote 4 Table 1 reports the statistical results.

Table 1 Mean values of the 16 target questions for each condition (original and improved)

The results in Table 1 show that the mean differences between the original and improved conditions are negligibly small (differences < 0.05). This applies to all target questions, irrespective of the concepts (i.e. redistribution, governmental scope, social trust, and welfare chauvinism). The only exception is the first question on redistribution (red 1), which has a significantly higher mean value in the original condition. Nonetheless, these results provide strong empirical evidence that respondents’ answer behaviour is not affected by the rating scale design when respondents are asked survey questions on political solidarities and related concepts.

4.2 Response times

Response times enjoy a long tradition in social psychology and survey research (Couper and Kreuter 2013; Yan and Tourangeau 2008) and have proven to be useful measures of response effort (Bassili and Scott 1996; Fazio 1990; Höhne et al. 2017; Lenzner et al. 2010; Yan and Olson 2013). In general, researchers assume that the time taken to process questions corresponds (directly) to the response effort required to answer a survey question. This, in turn, suggests that the longer (shorter) a respondent needs to answer a question, the higher (lower) the response effort.

We investigated the response effort associated with the survey questions between the original and improved conditions. The response times were measured in milliseconds and were defined as the time elapsing between the presentation of the question on the screen and the submission of the online survey page. To compare response times, we computed median values and thus used Mann–Whitney (U) tests. Table 2 reports the statistical results.

Table 2 Median response times (ms) of the 16 target questions for each condition (original and improved)

Comparing median, as displayed in Table 2, one can observe that respondents take a consistently longer time to answer the survey questions with the original rating scales than the survey questions with the methodologically improved rating scales. Specifically, we find significantly longer median response times in the original condition for 13 out of 16 comparisons. The only exceptions are the first question on redistribution (red 1) as well as the third and fourth questions on welfare chauvinism (wel 3 and 4), for which we do not find significant differences. Overall, these findings provide strong empirical evidence that the methodologically improved questions, compared to the original questions, require less response effort in terms of response times.

4.3 Criterion validity

Finally, we investigated data quality in terms of criterion validity between the original and improved conditions. Specifically, we examined the strength of the correlations between the 16 experimentally manipulated target questions and the five criterion questions on social benefits, immigration, and political trust (i.e. trust in parliament, trust in politicians, and trust in parties). Remember that higher (lower) correlations indicate higher (lower) criterion validity. The criterion validity analyses were conducted by estimating unstandardized OLS regression coefficients with the target questions as independent variables and the criterion questions as dependent variables. Table 3 reports the statistical results.

Table 3 OLS regressions to determine criterion validity (unstandardized coefficients)

As shown in Table 3, the original questions on redistribution, governmental scope, social trust, and welfare chauvinism show slightly higher correlations with the criterion questions than their improved counterparts. There are only a few exceptions, such as the second governmental scope (gov 2) question on immigration. Even though the majority of the original questions show higher correlations, only two comparisons show significant differences: (1) the first redistribution question (red 1) on social benefits and (2) the first social trust question (soc 1) on trust in parliament. For the remaining comparisons, no significant differences exist. Altogether, these findings indicate that the original and methodologically improved questions have similar levels of data quality in terms of criterion validity.

5 Discussion and conclusion

The aim of this experimental study was to evaluate the response effort and data quality of established political solidarity measures, a sub set of welfare state attitudes. Response effort was measured using response times (Bassili and Scott 1996; Fazio 1990; Höhne et al. 2017; Lenzner et al. 2010; Yan and Olson 2013), whereas data quality was determined by estimating criterion validity (see, for instance, Höhne and Yan 2020; Yeager and Krosnick 2012). The results indicate differences in response time, but almost no differences in criterion validity. In the following, we discuss the empirical findings in detail.

In the first step of our analysis, we investigated the answer distributions of the questions with the original and improved rating scales. The mean comparisons revealed almost no differences between the two conditions, even though the design of the rating scales differed substantially in some cases (e.g. four-point, completely labelled rating scales with a non-substantive answer option vs. five-point, end verbalized rating scales without non-substantive answer option). We see these findings as good news, because they show that established measures of political solidarities and related concepts are robust against rating scale effects. To put it differently, respondents’ answer behaviour is not affected by the rating scales we examined.

In order to evaluate the response effort of the questions with the original and improved rating scales, we collected response times in milliseconds (i.e., the time elapsing between the presentation of the question on the screen and the submission of the online survey page). In doing so, we followed a long line of research in social psychology and survey research (Couper and Kreuter 2013; Yan and Tourangeau 2008). Our findings indicated substantial differences between the two rating scale conditions. Response times were consistently higher in the original condition than they were in the improved condition. This finding points to the fact that the questions with the improved rating scales, compared to the questions with the original rating scales, reduce response effort. Following Bradburn (1978), we argue that it is the responsibility of researchers not to gratuitously increase response effort for respondents; i.e., if this increase is not expected to improve data quality. We thus recommend the use of the improved rating scale design in place of the original rating scale designs.

To evaluate data quality, we examined the criterion validity of the questions with the original rating scales and their improved counterparts. Specifically, we determined the strength of the associations between the experimentally manipulated target questions and the criterion questions that all respondents were asked. The results demonstrated almost no criterion validity differences between the two rating scale conditions, which indicates that the original and improved rating scales do not differ in data quality. Even though the original rating scales do not follow contemporary best practices, they can be considered equal in data quality to the improved rating scales. In our opinion, this is also good news, as it suggests that existing measures of political solidarities and related concepts are of good data quality in terms of criterion validity.

This study has some limitations that provide avenues for future research. First and foremost, we conducted our study in one country (Germany). However, some of the questions under investigation in this study were selected from cross-cultural and cross-national surveys, such as the ESS. We therefore cannot draw any conclusions beyond Germany and thus we call for further cross-cultural and cross-national research. Second, and relatedly, we used a quota sample from a non-probability access panel. This does not decrease the internal validity of our experimental study, but it might limit the generalizability of our empirical findings. Hence, it would be worthwhile to investigate rating scale design of questions on political solidarities and related concepts using a probability-based sample to increase generalizability. Third, since respondents of this study were members of an access panel who participate in web surveys on a regular basis, they may have a high level of survey experience compared to the general population. Some research indicates that respondents with high survey experience differ from respondents with low survey experience in terms of response behaviour (Toepoel et al. 2008). For this reason, we recommend that future studies take respondents’ level of survey experience into account.

Despite its limitations, this study provides important insights on the impact of rating scale design on answer behaviour. Keeping in mind both our findings and the contemporary best practices on rating scale design we believe that a methodologically sound rating scale has the following characteristics: five-point, end verbalized, and without non-substantive answer options. This applies, at least, to measuring political solidarities and related concepts. The improved rating scale design results in a level of data quality that is comparable to the original rating scale designs, but requires less response effort.