1 Introduction

Cross-cultural comparative research requires equivalent measurement in different cultures to draw valid conclusions. A crucial element in establishing equivalence is comparable translation. Questionnaire translation has been on the research agenda in cross-national social sciences since the late 1990s (Harkness and Schoua-Glusberg 1998). Quite some research since then has been focusing on translation and translation assessment methodology, bringing about, amongst others, the team translation model TRAPD (Harkness 2003) as a countermodel to (simple) back translation (Brislin 1970). The methodology debate continues, including a revived discussion around the (non-)potential of the back translation method (Colina et al. 2017; Epstein et al. 2015), or technology-driven advances in terms of machine translation (Zavala-Rojas et al. forthcoming). However, oftentimes, the actual impact of different translation choices (incl. deviations and errors) on the resulting data remains unclear. Thus, the AAPOR/WAPOR Task Force Report on Quality in Comparative Surveys suggests as future direction for the field of questionnaire translation to learn more about the impact of different translation options on the resulting data to be in a better position to guide translation activities: “[…] The future will need to see more qualitative and quantitative (experimental and evaluation) studies focusing on translation quality assessment.” (Lyberg et al. 2021, p. 64) (see also Smith 2020, who calls for more quantitative evidence). Already in 2008, Harkness et al. had uncovered many differences and errors in translations of the World Mental Health Survey Initiative, but they refrained from clearly attributing survey error or diminished interview experience to these—further research beyond expert assessments would be needed to indicate which differences or errors do indeed matter and which questions are robust despite differences between translations and the source instrument. In medical research, a discipline that operates quite independently from cross-cultural survey methodology, translation versions of several instruments produced through different translation methods (e.g., including or not back translation) were compared to each other, leading to no or only minor psychometric differences, even though—at least partly—accuracy and preference among the target population differed (Epstein et al. 2015; Hagell et al. 2010; Perneger et al. 1999). The potential robustness of an instrument regarding structure and item content was put forward as a potential reason, as was the cancelling out of imperfections in both translations (Perneger et al. 1999). Roberts et al. (2020) assessed a subjective well-being measure across scale formats, modes, as well as linguistic and cultural contexts. They found that translation versions and cultural differences contributed more to non-equivalence than scale format or mode. Repke and Dorer (2021) fielded a web survey with different Estonian and Slovene translation versions that were situated on a continuum between close and adaptive translation. Not considering conceptual challenges of the experiment here, they found items being sensitive to translation wording (i.e., small linguistic changes in translations had an impact on the data) and items being more robust (i.e., the meaning of a concept remained the same despite differences in translation wordings). Behr and Braun (2023) implemented split-ballot experiments comparing German translations stemming either from a team translation approach or from a simple back translation approach. Despite obvious faults—at least at face value—in the version resulting from the back translation approach, there was in most cases no statistical difference between the two tested translation versions. Overall, these few published studies are comforting in the sense that not all differences or even errors matter. This study builds on the quantitative results by Behr and Braun (2023) but adds qualitative evidence for the same respondents, thus making the study an innovative mixed-method study. The qualitative evidence comes from probing questions implemented in the same web survey. Using quantitative and qualitative evidence, we aim to explore the following question: What do slight differences in translation mean for respondents’ answers? And what can we say about comparability to the English source?

While answering these questions, we will also showcase the usefulness of web probing for exploring the meaning of different translation versions. Web probing, understood as the implementation of cognitive probes that are typically asked in cognitive interviews in web surveys, has been developed and refined over the past 10 years (Behr et al. 2020). It has been used in post-hoc studies understanding suspicious data in main surveys (e.g., Meitinger 2017) but also in pretesting studies (e.g., Hadler et al. 2022).

2 Methods and data

The source items and the two German translation versions for each item, as used in this study, come from a research project in which the team approach towards translation (TRAPD) was compared against a simple back translation approach (described and analysed in detail in Behr & Braun (2023). In the present paper, we focus on different translation versions of the same item—regardless of which translation approach they are based on. Both translation versions represent more or less “plausible” solutions of the translation task, but we see—on the pure semantic level—differences, which trigger this research. The original British English items come from the ISSP (International Social Survey Program) modules on “Social Inequality” (2019) and the “Environment” (2020).

2.1 Item selection

For the present case study, we selected the following three items, which were followed by a probe allowing us to exploit the usefulness of open-ended probing questions.

The first example is the item “[COUNTRY] should limit immigration in order to protect our national way of life.” The item was rendered by two different German language versions: In one of them, “way of life” had been translated by “Kultur” [culture] and in the other version by “Lebensweise” [way of living]. Even though both versions seem plausible, we see a (slight) meaning difference between these terms and are thus curious to learn about potential effects on the data.

The second example is the question “Thinking about your neighbourhood, to what extent, if at all, was it affected by the following things over the last twelve months? Air pollution/Water pollution/Extreme weather events.” The item text was supplemented with a translation note saying: “By ‘neighbourhood’ we mean the part of the town/city the respondent lives in. If he/she lives in a village, this can be taken as his/her ‘neighbourhood.’ ‘Affected’ refers to the impact on the neighbourhood.” In one version, “neighbourhood” was translated into German in a way that typically indicates the neighbours close to one’s home and/or proximity (“Nachbarschaft”) rather than the area one lives in more in general; the latter meaning, more in line with the translation note, was covered in the second version, which used “Wohngegend” [area where you live]. We were particularly interested in whether the narrower notion of “Nachbarschaft” would manifest itself in the data.

The third example is the question “Some people feel angry about differences in wealth between the rich and the poor, while others do not. How do you feel when you think about differences in wealth between the rich and the poor in [COUNTRY]? […]”. “Differences in wealth” was translated by “Wohlstandsunterschiede” [translation alluding to differences in economic welfare] in the first and by “Vermögensunterschiede” [translation alluding to differences in assets/property] in the second translation. Again, on the semantic level, there is a difference between these two German terms, but would this matter? “Wohlstand” alludes more to a high standard of living, a situation that grants economic security (Duden 2022), while “Vermögen” refers to property having a material value (Duden 2022) and as such is more concrete.

2.2 Split-ballot web study

The web survey was implemented both in German (Germany) and in English (Britain), but the English study is only ancillary to this article, as here only one version was implemented and probed. For the data collection, we commissioned the online access panel provider respondiFootnote 1. Respondents were recruited using a quota sample balanced according to age, gender, and education. Data were collected in November 2020. In the German questionnaire, our three items were out of a total of 15 questions for which we randomized the question text (independent randomizations at each of these questions). For the German survey, 1422 panellists clicked on the link to the survey. Of those, 361 were rejected due to a full quota, 37 were screened out, and 49 broke off. The break-off rate was 5.7%. For the British survey, 1438 panellists clicked on the link to the survey, 730 were rejected due to a full quota, 152 were screened out, and 69 dropped out. The break-off rate was 12%. Table 1 provides a brief overview of both the German survey and the British survey.

Table 1 Respondent characteristics in the web survey

2.3 Probing

We followed each of the questions with open-ended probes to learn about the cognitive processes of respondents with the two different translation versions (Behr et al. 2020). Tables 2, 3 and 4 summarize the English source wording for items and probes alongside the two German translations. The Version 1 translation stems from the back translation approach; the Version 2 translation stems from the team translation approach.

Table 2 Item “way of life” and probe
Table 3 Item “neighbourhood” and probe
Table 4 Item “wealth” and probe

2.4 Coding schemes for probe answers

The coding scheme development was based on an inductive approach and considered answers from both German and British respondents to consider potentially country-specific answer patterns (Behr 2015). The answers were coded in their original languages, i.e., German and English. The open-ended answers offered by the respondents were coded by either the second or the third author. 15% of the answers were double coded by either the second or third author. Intercoder reliability (Holsti’s coefficient) for “way of life” reached 85% (German answers) and 76% (English answers), respectively; for “neighbourhood” they reached 92% (German answers) and 85% (English answers), respectively; and for “wealth” they reached 80% (German answers) and 69% (English answers), respectively. Differences were discussed and reconciled for the final dataset. Tables 5, 6 and 7 show the categories, a short description, and an example for the categories for each of the items.

Table 5 Coding schema for “way of life”
Table 6 Coding schema for “neighbourhood”
Table 7 Coding schema for “wealth”

2.4.1 Categories present in all coding schemes

There are a few categories present in all coding schemes. “Non-response” means that respondents leave a blank text box, “don’t know” that they explicitly state they do not want or are unable to answer, “mismatch” means that they answer but not the actual question at stake (e.g., they give reasons for their choice of an answer category with the closed item and do not communicate their understanding of the keywords in question), “unproductive answer” that they give responses without a substantive meaning, and “other” that the answer is substantive but cannot be accomodated in the category scheme. These are mainly codes that do not help very much in illuminating which concepts respondents had in mind. However, they can be useful in pointing to comprehension problems or difficulties of understanding specific concepts. The non-substantive codes and the “other” code are exclusive, that is only one of these codes is coded. This does not apply to the substantive codes, which can be combined with one another.

The substantive categories are question-specific and we are going to address them now in turn.

2.4.2 Substantive codes for “way of life”

Most of the substantive codes, as presented in Table 5, are self-explanatory: “traditions & cultural practices”, “language”, “(fundamental) values”, “religion”, “history”, as well as “law and system of justice”. “Social system” concerns (elements of) the social system of a country, such as education and health care. “Environment” refers to everything related to environment and nature including the corresponding behavior of individuals. “Diversity of population” is coded if respondents mention cultural diversity as a positive trait of a society, “stereotypes” if negative stereotypes on the side of the majority population are criticized and “concept does not exist” if the concept of a national “way of life” is regarded as meaningless.

2.4.3 Substantive codes for “neighbourhood”

The substantive codes for “neighbourhood”, presented in Table 6, mostly refer to the closeness versus distance of the area with regard to the respondent. There is a clear graduation from “house/estate/immediate area”, over “village/rural area” and “city/community/county” to “region/country.” In addition, there is an extra code for “unclear area codes.” An exception of this coding principle is “neighbours”, which is used when respondents mention neighbours as people and not as an area category.

2.4.4 Substantive codes for “wealth”

The substantive codes for “wealth” are presented in Table 7. “Salary/wages from work” and “assets/property” are self-explanatory. “Monetary wealth” is coded if particularly the source of wealth is left unspecified. “Lifestyle positive” and “lifestyle negative” refer to the consequences of wealth and poverty, respectively, and to those living in these two conditions. “Access to resources” is coded if people mention access—or the lack thereof—to mostly public resources, such as healthcare and education.

3 Results

Firstly, we will present results quantitatively by looking at test statistics (t test) for split-ballot experiments for our three items. Since these items were not part of multi-item batteries, equivalence tests could not be implemented. Secondly, based on the open-ended probes, we will investigate the associations that the different translation wordings trigger among respondents. Lastly, we will regress the dependent variables on the associations by respondents (that is, categories in a coding scheme) in order to identify to which extent different understandings may lead to different survey results.

3.1 Quantitative testing: results from the web survey

The three sets of items included in the web study are presented in Table 8 along with the results. The complete results from the web survey are presented in Behr and Braun (2023).

Table 8 Quantitative results of web survey (n = 972)

Two of these items do not show any quantitative differences (regarding the means) between translation versions: the different translations of “way of life” and “neighbourhood.” Quantitative differences could be found for “wealth” only. The following paragraphs address these items in turn by adding the qualitative evidence.

3.2 Qualitative evidence from probing

3.2.1 “Way of life”

While the quantitative assessment did not show any differences, based on the probing question, there are significant differences between both translation versions with regard to which aspects come to mind to respondents (Table 9). On the one hand, “culture” produces fewer mismatching responses and fewer responses evoking elements of the “social system” than “way of living.” On the other hand, it triggers more responses referring to “language” and “(fundamental) values” than “way of living.” Roughly comparing these results to the British figures, the German translation “Lebensweise” seems closer to the English “way of life” in this particular context, but the means comparison shows that slight meaning shifts are not detrimental for this item.

Table 9 Frequencies of the categories for “way of life” in different subsamples

A regression allows us to delve deeper into the impact of the significantly different associations that respondents have in mind when being presented with different renderings of the English term “way of life” (Table 10). Both the “mismatch” category (strong in the “way of living” version) and the “(fundamental) values” category (strong in in the “culture” version) impact on the response; both being linked to more hostile reactions to immigration.

Table 10 Impact of respondents’ associations on the closed item “way of life”

3.2.2 “Neighbourhood”

If “neighbourhood” is translated in a close way as “Nachbarschaft”, it is neighbours who come to mind in 8% of the cases in Germany, while with a rendering as “Wohngegend” [area of living] no one in our German sample mentions the neighbours. In Britain, 1% of the respondents think of neighbours—which comes close to the German figure in the case where “Wohngegend” is used. Other than that, there is also some difference with the smallest distance area “house/estate/immediate area” which is more frequently mentioned by the Germans who received the closely translated “Nachbarschaft” version. The British mention this category still a bit more frequently. There are no significant differences between the two German-language versions for the other categories (Table 11). In sum, both German translations share similarities and differences with the English source version. As the means comparison shows, these shifts are not detrimental.

Table 11 Frequencies of the categories for “neighbourhood” in different subsamples

With a regression we examine the impact of the significantly different associations of respondents (Table 12). Only code 4—“House/estate/immediate areas”—shows a small significant negative effect on air and water pollution as well as on extreme weather events, which is likely due to these events or pollutions not necessarily affecting the smallest area around one’s home. Only the results for one of the items (“air pollution”) are shown below.

Table 12 Impact of respondents’ associations on the closed item “air pollution”

3.2.3 “Wealth”

The third example was based on the question “Some people feel angry about differences in wealth between the rich and the poor, while others do not. How do you feel when you think about differences in wealth between the rich and the poor in [COUNTRY]?” “Differences in wealth” was translated as “Wohlstandsunterschiede” [alluding to differences in economic welfare] in one version and “Vermögensunterschiede” [alluding to differences in assets/property] in the other. Here we uncovered significant differences in the means. However, in addition to the above-mentioned translation difference, there was another one regarding the rendering of “feeling angry” and its integration into the response scale: Version 1 was more stilted (“Ärger verspüren”) and Version 2 more colloquial (“sich ärgern”); Version 1 expressed the end point of extreme anger in a potentially more extreme way (“äußerst großer Ärger”) than Version 2 (“sehr stark darüber geärgert”).

As for the qualitative results from probing, there are several significant differences between the two language versions, as Table 13 shows: Mismatching answers are less frequent in the “differences in assets/property” version. This might simply mean that it is easier to respondents to figure out what “differences in assets/property” mean compared to the meaning of “differences in economic welfare.” On the substantive side, “salary/wages from work” as well as “assets/property” come to mind easier when “differences in assets/property” is asked for. On the contrary, (negative) aspects for the lifestyle come easier to mind with the “differences in economic welfare” version. In sum, both translations share differences and similarities with the British understanding: While “Vermögensunterschiede” triggers “salary/wages from work” as well as “assets/property” in a similar way compared to the British term “differences in wealth,” the German “Wohlstandsunterschiede” comes closest to the British term when it comes to triggering lifestyle aspects.

Table 13 Frequencies of the categories for “wealth” in different subsamples

A regression for this item reveales that associations of respondents with “assets/property” significantly reduce the anger, while the opposite is true for associations of respondents with a “negative lifestyle” (Table 14). The lack of significant results in the base model when looking at the comparison between translation versions (split variable) contrasts with the results in Table 8. This is due to the regression in Table 14 being based on a smaller n, as only a random sample of respondents received a probe question.

Table 14 Impact of respondents’ associations on the closed item “wealth”

4 Discussion

For the first item, the means comparison was not significant, which is, first of all, good news. While “way of living” is a bit closer in meaning to the English source based on respondents’ open-ended associations, “culture” is not too far off. Overall, both German translation versions share core meanings, in particular related to “traditions and cultural practices”, “(fundamental) values,” “religion,” and “language.” Exceptions are related to the categories of “environment” and “history”: responses related to these latter categories are only mentioned in relation to one translation but are not central to respondents and may be regarded as “fuzzy edges” of the concepts (cf. “prototype semantics”, Kussmaul 1994, 2007). The regression provides a potential argument as to why certain translation versions may not lead to different results. Two answer patterns (“(fundamental) values” and “mismatch”)—the first stronger in the version with “culture,” the second stronger in the version with “way of living”—led to more xenophobic reactions among respondents. Since these patterns are in both translated versions, this may be the reason why we do not see differences in the data between translation versions. Why may “(fundamental) values” and “mismatch” responses lead to more xenophobic reactions? “(Fundamental) values” seem to be particularly sensitive and worth of protection. For understanding the role of “mismatch” responses, a closer look into the German item wording seems useful. Literally back translated, the translations read: “Germany should limit immigration to maintain our own way of living.” “Our own” may signal to some respondents that their personal way of living may be at risk, which is why we might see the increased number of mismatches. Typical “mismatch” responses are—for the present item—xenophobic answers, such as “we already have too many, partly dangerous cultures, with us” (ID 326). In the end, which translation version is better and more comparable to the source? While “Lebensweise” is a bit closer to “way of life” than “Kultur,” at least when based on the open-ended probe responses, ultimately both versions function comparably; no translation can be ruled out as inadequate.

For the second example, the means comparison shows that the wording “Nachbarschaft” (alluding to neighbours/proximity) was not as problematic as it had seemed at the beginning. While a certain percentage of respondents thought of their neighbours, the other associations testify that, overall, the term “Nachbarschaft” covers the meaning of “neighbourhood” in an equivalent way. The same holds true for “Wohngegend.” Slight differences compared to the English associations for both translations show that there is not necessarily full equivalence between a source word and its translation(s); translation strives to be an approximation at best (Gile 1995). A certain loss or gain needs to be accepted (Munday 2016). Besides, with this item, we are likely to cover country-specific patterns of dwelling, which is why full comparability to associations in the English source is not likely anyway. We would also like to emphasise that the item context may have helped respondents to understand the notion of “Nachbarschaft” in the meaning as intended by the item. Based on the listed types of pollution or extreme weather events, it may be that most respondents deduced that larger areas, beyond the closest proximity of neighbours, are meant anyway (Behr & Braun 2023). In this context, we want to refer to Harkness et al. (2010), who stressed the importance of context both in questionnaire design and translation, calling for a theoretical framework that fully accommodates context. Even though the regression indicated that associations of “house/estate/immediate area” can affect the closed survey responses, this finding was, overall, not decisive. Which version is better and more comparable? In the end, both versions function in a comparable way; no translation can be ruled out as inadequate.

For the third item, the means comparison signalled that something is going on here. Even though we see that core meanings are covered in both translations, two meaning dimensions were distributed quite differently across the translation versions: “Assets/property” was particularly strong with “Vermögen”—this is also in line with the dictionary definition of “Vermögen”, which stresses material property (“gesamter Besitz, der einen materiellen Wert darstellt”, Duden 2022). Typical answers refer to “real estate” [“Immobilien”], “property” [“Eigentum”], “shares” [“Aktien”], and “inheritance” [“Erbe”]. The association of “negative lifestyle” was particularly strong with “Wohlstand”; this equally fits to the dictionary definition of “Wohlstand”, which stresses the notion of living standard or economic security (“Maß an Wohlhabenheit, die jemandem wirtschaftliche Sicherheit gibt; hoher Lebensstandard”, Duden 2022). The following is an example answer: “I thought about the fact that there are people who can throw money out of the window, while others would starve to death without the food banks and struggle every month not to lose their homes.” For these two associations, the effects in the regression go in different directions; these results, however, do not allow us to explain the results provided in Table 8 when the full sample is used for the t-test. The version with “Vermögensunterschiede” had triggered significantly more anger in the larger sample; this does not fit to the respondent associations of “assets/property” minimising anger. We alluded before that also the translation of “feel angry” and its integration into the scale differed in the two translation versions. In the end, this may have been the main driver of the differences, superseding all other differences in wording. With the above results in our minds, we dare to suggest that the more colloquial translation of “feeling angry” coupled with a potentially less extreme endpoint label paved the way for more easily selecting more extreme response options. Different “feeling” and different scale translations are certainly worthwhile to be pursued in future research (Perneger et al. 1999, list examples of different “feeling” translations and their impact; Villar 2009, shows the effects of different translation-related changes in response labels). Returning to the translation of “wealth,” which translation is better and more comparable to the source? Looking at the distribution of the open-ended responses and their comparison with the English responses, no final decision can be taken. In the end, we see that languages cut up reality differently and that we need to take decisions, even if these are trade-off decisions. On the semantic level, the term “Vermögen” is certainly closer in meaning to “wealth,” but the wider associations regarding positive and negative lifestyle, likely triggered in the context of the entire item, bring “Wohlstand” closer to “wealth.” Either translation can be chosen in this context, but neither translation seems to be a fully equivalent match here.

5 Conclusions

The study has shown that different translation versions do not necessarily impact on the resulting data, even though they can, as testified by the third item. If there are different associations with different translation versions, these may lead to effects cancelling each other out, that is, to effects that both go in the same direction. If this happens, either translation version works well. However, translations versions can also lead to associations resulting in opposite response behaviour, which may then impact on survey data. The latter we could not proof with our data, though, since additional translation differences blurred the results. What does this mean for translation? We are still at the beginning when it comes to understanding what different translation versions mean to respondents and more research is certainly needed. For the time being, we can only emphasise best practice and recommend bringing together a team of skilled persons. Together, these should ideally have high proficiency in the respective languages; translation know-how, including research skills (using dictionaries, the web, etc.); substantive knowledge on the concepts to be measured, and questionnaire design/field experience. They should be encouraged to think of the core meanings of the English source item and how these could be covered by a translation. This may lead to different translation options on the table, which may even be all fitting in the context of a given item. And where subjective knowledge ends, empirical testing, such as web probing used in this study, can help to shed light on respondents’ interpretation and take final decisions that bring the translation as close as possible to the meaning of the source text. This study provides a blueprint for such empirical testing.

Further research should attempt to systematically link translation decisions and reasoning to data outcomes. This may be achieved by drawing on written documentation on particular decisions (cf. Behr & Zabal 2020) or by accessing recordings of think aloud translations or team discussions (e.g., Behr 2009, Dorer 2020). Thus, we could learn more about successful translation strategies that lead to the desired outcome in the data.

6 Limitations

For our web survey, we recruited respondents from a non-representative online access panel. Hence, the population we reached is not representative to the general population in Germany. What the illiterate population, for instance, or those less active in the digital world will make of these items can therefore not be answered.

Web probing suffers from non-response; to some extent, we compensated for this by having a rather high number of web probing respondents (more than 220 per probe and version).Footnote 2 Moreover, most types of non-response did not differ between translated versions; but if they differed, we tried to suggest reasons related to specific translation wordings.

Testing three items and their translations—and this only in one language—does not allow for generalization of findings, but the study provides an insightful glimpse into what different translations may mean to respondents. The results invite to delve deeper into decision-making in and impact of translations.

The regressions only have a suggestive character; we start in each case with non-significant results, partly based on small n. Therefore, these analyses shall incite researchers to take up the resulting suggestions and integrate them into more nuanced and targeted research in different languages.