1 Introduction

Research on human wellbeing has been criticized for its Western centricity and disproportionate focus on samples from Western, educated, industrialized, rich, and democratic (WEIRD) contexts, with calls being made for scholarship that is more inclusive and representative of the global population (Lambert et al. 2020; Wong & Cowden 2022). This shift is necessary because the growing public health concerns that are facing the global human population (e.g., burden of mental illness) can only be fully understood and appropriately addressed by considering the sociocultural diversity of people living in different geographic contexts around the world (Counted et al. 2024; Herrman et al. 2022; Moitra et al. 2023). Recent efforts are seeking to address some of these criticisms more systematically, ushering in a new wave of “global wellbeing science” (Lomas 2022, p. 264). For example, the Global Wellbeing Initiative (a partnership between Gallup and the Japan-based Wellbeing for Planet Earth Foundation) was established to develop a more inclusive and well-rounded understanding of wellbeing around the world (Lomas et al. 2022). Although the emerging wave of global wellbeing scholarship represents an important opportunity to enrich the existing landscape of wellbeing science, there are also many challenges.

Aside from practical hurdles involved in the process of carrying out research on a global scale (e.g., funding to execute large-scale research, identifying and bringing together a unified team of collaborators from different countries), there are conceptual and interpretive challenges when attempting to quantitatively assess, describe, and compare the wellbeing of people living in different parts of the world. Wellbeing might be broadly understood as the relative attainment of a personal state of quality across the various dimensions of human existence (VanderWeele & Lomas 2023). Over the years, many theoretical frameworks of wellbeing have been proposed, and the conclusions that are drawn from cross-cultural research can vary based on the underlying theory guiding measurement. For example, whereas VanderWeele’s (2017) model of human flourishing identifies physical health as a key constituent of wellbeing, it is not part of the five pillars of wellbeing outlined in Seligman’s (2011) PERMA model. Conceptual distinctions between wellbeing models can make it difficult to compare cross-cultural evidence from studies that differ in the type of wellbeing model employed, which is further complicated by the possibility that certain models of wellbeing might align more closely with how wellbeing is understood and expressed in some cultures compared to others. Moreover, theoretical frameworks employed in cross-cultural research on wellbeing often prioritize constituents of wellbeing that are thought to apply across cultures. Although there are advantages to such an approach (e.g., parsimony), an important trade-off is that culturally-specific constituents of wellbeing may be underemphasized or overlooked (Höltge et al. 2022). There are also localized linguistic and cultural influences that shape how people experience, make sense of, and engage with the world around them, including survey items (Lomas 2019). To illustrate, researchers have found the meaning of salient terms in the wellbeing literature (e.g., happiness) often varies across cultures and languages, and some cultures have a more liberal threshold for using certain words than others (Oishi et al. 2013; Wierzbicka 2004). These complexities suggest that any attempt to describe and compare human wellbeing across countries ought to be sensitive to potential variation in how individuals in different contexts understand the survey items that are presented to them (Smith 2004). In this study, we use multinational cognitive interview data from the survey development phase of the recently launched Global Flourishing Study (GFS) to evaluate a set of survey items linked to five central domains of personal wellbeing.

1.1 Overview of cognitive interviewing

Measurement challenges in survey research can affect the validity of conclusions that are drawn. Although difficulties in measurement cannot be completely eliminated, cognitive interviewing (also sometimes referred to as cognitive testing) is one systematic approach that can help with identifying and mitigating potential measurement issues (Lenzner et al. 2023). Because cognitive interviewing can provide insights into “the underlying manner in which survey respondents interpret and mentally process survey questions” (Willis 2015, pp. 359–360), it is routinely included as a key step in survey design, piloting, and refinement (Ryan et al. 2012).

Cognitive interviewing has traditionally relied on a four-stage model that captures the primary cognitive processes people engage in when they interpret and respond to survey items (Lenzner et al. 2023; Willis & Artino 2013): (1) comprehension (how do participants interpret the survey item?), (2) information retrieval (how do participants retrieve relevant information from memory to provide a response to the survey item?), (3) judgment (how do participants make a judgment about what their response to the survey item should be?), and (4) response selection (how do participants produce a response that maps onto the response options provided?). Issues with comprehension tend to dominate (Willis & Artino 2013), but response error may arise at any step of the four-stage cognitive process. For example, some evidence (e.g., Brenner 2017) suggests individuals may overreport frequency of religious service attendance (e.g., “Aside from weddings and funerals, how often do you attend religious services?”) because they interpret the question as being more about their religious identity (comprehension) or they provide a response that reflects their desired level of attendance rather than their actual attendance (judgment).

Although there is no single cognitive interviewing approach, commonly used techniques can generally be grouped into two categories: think-aloud vs. probing (Scott et al. 2021). In think-aloud, individuals are asked to verbally describe their thinking as they respond to the question or item. Use of probing involves asking one or more follow-up questions (either immediately after the individual responds to the survey item or after they have completed the entire survey) to obtain insight into the participant’s thinking (Lenzner et al. 2023). Both techniques have advantages and disadvantages, and the relative utility of each technique often depends on the objective of cognitive interviewing. For example, think-aloud may contaminate ongoing cognitive processing of the survey item by requiring individuals to externalize their thinking, whereas probing has the potential to bias self-reported mental processes involved in responding to the item. Probing can be particularly advantageous for studying item comprehension because think-aloud may not lead participants to spontaneously report information that is useful (Willis 2004).

1.2 Cross-cultural cognitive interviewing

In the measurement literature, considerable work is undertaken to empirically explore the utility of wellbeing measures across different cultures using quantitative analytic approaches. For example, research that seeks to establish the cross-cultural validity of wellbeing measures often makes use of statistical analyses that provide insight into whether a measure is thought to be invariant (functions similarly) across different cultures (Lack et al. 2022). Against the backdrop of rising concerns about questionable measurement practices (Flake & Fried 2020), there is growing recognition of the need for cross-cultural research (broadly referring to research involving cultural, linguistic, or geographic variation) on wellbeing to strengthen the robustness of measurement by integrating qualitative approaches alongside more commonly used quantitative approaches (Benítez et al. 2018). One such approach is cognitive interviewing.

The goals of cross-cultural cognitive interviewing align closely with those of standard cognitive testing (e.g., detecting problems with survey items), but a key difference is that cross-cultural cognitive interviewing is also used to elucidate whether the “range of interpretations associated with the evaluated items varies acceptably between cultural or language groups” (Willis 2015, p. 363). Whereas psychometric techniques are useful for determining whether noncomparability issues are present, cross-cultural cognitive interviewing can assist with identifying and addressing potential sources of noncomparability among groups (Willis et al. 2011). When differences in interpretation are observed among groups, researchers must decide if such disparities present a threat to the cross-cultural equivalence of the survey items and whether revisions to one or more versions of the survey (e.g., source version, target-language translated version) are necessary.

Existing research supports the usefulness of cross-cultural cognitive interviewing in cross-cultural research. Based on a review of 32 studies that applied cross-cultural cognitive interviewing, Willis (2015) found that the method can play an important role in identifying (1) terms, phrases, and response scales that are problematic in certain cultures, (2) cross-cultural variation in the interpretation of items and their intent, and (3) differences in the applicability of concepts across cultures. Procedural features of the studies that were included varied, with some indication that probing tended to be used more frequently than think-aloud. Although some probe varieties were less effective in certain populations and contexts compared to other varieties, Willis (2015) concluded that “probing appears to be effective for all cultural and language groups studied to date” (p. 390). Moreover, given some of the complexities associated with cross-cultural cognitive interviewing (e.g., logistical challenges related to recruiting and training interviewers in many countries) and the general difficulties that participants can have with think-aloud protocols, some scholars have suggested that it may be preferable for cross-cultural cognitive interviewing to focus on probing (Willis 2015).

1.3 Present study

The GFS is a large multinational panel study aiming to explore the determinants and relations among different aspects of human wellbeing with more than 200,000 people across a geographically and culturally diverse set of 22 countries that were selected to provide substantial coverage of the global population (Crabtree et al. 2021; Johnson & VanderWeele 2022). It provides a rich and publicly available source of nationally representative survey data on various topics related to the wellbeing of individuals and communities in different parts of the world, with promising implications for researchers, policymakers, and practitioners involved in the promotion of human wellbeing. Given the significant potential of the GFS to inform and shape the conversation around human wellbeing, various stakeholders may benefit from insights about the comparability of the GFS survey items across groups (e.g., country, language) and possible sources of measurement disparities. Toward this end, in the present study we use multinational cognitive interviewing data from 22 countries to explore similarities and differences in difficulty and comprehension of five GFS survey items that are related to personal wellbeing.

2 Method

Cognitive interview data for this study were taken from the translation phase of the survey development process that was undertaken in preparation for the GFS. The GFS survey assesses attitudes and perceptions related to five core domains of personal wellbeing—(1) happiness & life satisfaction, (2) mental & physical health, (3) meaning & purpose, (4) character & virtue, and (5) close social relationships—along with a number of other domains or aspects of human life (Crabtree et al. 2021). Cognitive interviews were performed with the preliminary version of the baseline GFS survey to explore how individuals in different countries interpreted the survey items and identify potential issues with comprehension. To limit participant burden during the cognitive interviews, two forms were created. Roughly half of the GFS survey items were included in each form. A total of N = 230 participants completed cognitive interviews (a minimum of 10 participants from each of the 22 countries). About half of the participants from each country completed Form A (n = 116), and the remainder completed Form B (n = 114). The present study used data from participants who completed Form A. Cognitive interviews were conducted during March–April 2021. All data collection was performed in accordance with the ethical standards of Gallup and with the 1964 Helsinki Declaration and its later amendments.

2.1 Interview format and process

Cognitive interviewing was facilitated by Gallup, which has an extensive history of testing, adapting, and translating survey items into numerous languages for use in many countries around the world (e.g., Gallup World Poll). The cognitive interviews followed a concurrent probing strategy in which participants first responded to the survey item being tested, and then one or more structured probes were used to explore their thought process as they reflected on the item and formulated a response. Given the large number of countries included in the survey development process, structured probes were used during cognitive interviewing to ensure that probing was aligned across countries for ease of analysis (Willis 2015). In developing the cognitive interview protocol, the preliminary version of the baseline GFS survey was first translated from English into the target languages of each country. During the translation process, the survey items were translated from English into the target language by an experienced translator. The translated version was independently reviewed by a second translator. Feedback was provided to the first translator, who adjudicated the feedback and either adopted the recommendations of the second translator or provided appropriate reasons for retaining the original translation. A team of experienced researchers at Gallup reviewed these decisions, with any outstanding issues resolved through consensus. Once translation was complete, survey items were selected for Forms A and B, and additional questions and statements addressing the purposes of cognitive interviewing (e.g., probes) were integrated into each form. Form A can be found in Supplemental Text 1.

Three categories of anticipated probes were used in Form A: (1) comprehension probes (e.g., “In your own words, what is this question asking?”), (2) feasibility probes (e.g., “Was this question easy or difficult to answer? If difficult, what made it difficult?”), and (3) response process probes (e.g., “Why do you say that?”). In this study, we focus exclusively on a particular comprehension category and one type of probe more specifically (i.e., “In your own words, what is this question asking?”). There are several reasons for this. First, only a few survey items in Form A were accompanied by more than one probe, and comprehension probes were used more frequently than feasibility or response process probes. Because question or item comprehension has been identified as the first stage in the cognitive process of responding to a survey item (Willis 2004), focusing on a single comprehension probe enables us to perform an in-depth cross-country exploration of this foundational stage. Second, the probe of interest in this study had comparatively broad coverage, and it was the only probe that was used with a set of five survey items that at least loosely mapped onto five core domains of personal wellbeing (VanderWeele 2017). By concentrating on a single probe, it provides an opportunity to explore similarities and differences in comprehension between countries on survey items that are linked to personal wellbeing (see Table 1).

Table 1 Global Flourishing Study Survey Items Analyzed

Cognitive interviews in each country were performed by interviewers affiliated with Gallup’s local research partners. All interviewers are trained and have extensive experience in the methodology that Gallup uses to conduct cognitive interviews (Johnson et al. 2023). Gallup survey panels in each country were used to recruit a sample of participants that was diverse in age, gender, education, income, and urbanicity, which was an attempt to align the characteristics of the cognitive interview sample with the intended participants of the GFS (Scott et al. 2021). Interviews were conducted via telephone using a standardized format. Participants began by responding to a set of sociodemographic items before completing the GFS survey items and any probes. If a survey item was followed by a probe, participants first provided a response to the survey item and then responded to any subsequent probes. For each survey item, interviewers used a three-point scale to rate the extent to which they observed the participant having difficulty responding to the item (1 = No difficulty at all, 2 = Some difficulty, 3 = A lot of difficulty). Responses to the survey items and probes were recorded verbatim. In countries where the interviews were conducted in a language other than English, interview responses were translated and transcribed into English.

2.2 Data analysis

2.2.1 Quantitative analysis

Interviewer ratings were used to explore the difficulty that participants had responding to the five survey items. We report the mode for item difficultly ratings in the total sample and within each country, along with the percentage of participants in the total sample and each country who were identified as having “a lot of difficulty” responding to each item. We replicated this analysis for each language.

For each of the five items, we assigned a code (no vs. yes) to participants based on whether they answered the probe correctly (i.e., did their response to the probe correspond with the question they were asked?). We report the percentage of participants in the total sample and each country who answered the probe correctly for each item.

2.2.2 Qualitative analysis

Survey development researchers have increasingly called for the use of systematic approaches to analyze cognitive interview data (see Ridolfo et al. 2011). One suitable approach for analyzing cross-national cognitive interview data is the constant comparative method (Charmaz 2014), which is useful for capturing, filtering, and deciphering the relevance of cross-national interview responses in a systematic and rigorous manner (Ridolfo et al. 2011). The method asks broad questions of the data to investigate interview responses provided by participants. In this study, we applied the constant comparative method by formulating a guiding question (i.e., how did the participant interpret the survey question?) to explore how participants understood each of the five survey items. This approach provides an opportunity to strike a balance between gaining an understanding of the larger story (the essence of qualitative analysis) and identifying potential similarities and differences within and across countries. Due to the flexible nature of the constant comparative method in allowing for variation in responses to the comprehension probe for each item, we used all data and did not dismiss any variation in responses.

Similar to prior research with cognitive interview data (e.g., Beck et al. 2017), the qualitative analysis was performed by a lead analyst (second author) with ongoing support from the first author who was closely involved in the analytic process (e.g., engaging in discussions with the lead analyst about coding decisions, reviewing results at each step of the process). The analysis began by carefully reviewing all interview responses to the comprehension probe for each of the five items. Applying our guiding question to participants’ comprehension probe responses for each item, participants’ interpretations of each item were then captured in a working inventory. Once all participants’ interpretations were individually captured in the inventory, initial within-country codes for each item were iteratively generated. Codes function as descriptors of textual data, documenting the concepts, interpretations, and relationships among responses within a country that can then be expanded to make comparisons between countries (Babchuk 2019). The within-country codes for each item were analytically vetted, collapsed, and elevated to form within-country categories, which were then expanded into an overarching cross-country theme for each item. The flexible and iterative process of analysis allows for systematic and simultaneous comparison within countries and then across countries. Throughout the analytic process, memos were written to guide and inform the generation of categories and themes. Memos serve as a key analytical feature within the constant comparative method, as they record and provide insight into the relationships between different stages of the analytic process (Charmaz 2014). All memos generated through the analytic process in this study are publicly available via the Open Science Framework (https://osf.io/x25jt/).

3 Results

3.1 Sample characteristics

Sociodemographic characteristics of the sample are reported in Table 2. The mean age of the participants in the total sample was 41.4 years (SD = 14.9), approximately half of whom were female (50.9%), had completed secondary education (55.2%), and lived in a rural area (53.4%). There was some heterogeneity in the sociodemographic characteristics of the participants across the countries. Cognitive interviews were performed in 19 different languages across the 22 countries. A single language was used in each country, except for India where two languages were used (Bengali and Hindi). Only two languages were used in multiple countries (English and Spanish).

Table 2 Sociodemographic Characteristics of Participants

3.2 Quantitative analysis

Table 3 reports the results for item difficultly ratings in the total sample and within each country (see Supplemental Table S1 for results by language). In the total sample, fewer than 10% of participants were identified as having “a lot of difficulty” responding to any of the items. A similar pattern was found within each country, although participants in some countries appeared to have more difficulty responding to the items compared to other countries. Specifically, there were six countries in which more than one participant had “a lot of difficulty” responding to the items, namely Brazil (Portuguese), Nigeria (Yoruba), and the Philippines (Tagalog) for discriminated against, Israel (Hebrew) and Russia (Russian) for threat to life, and Egypt (Arabic) for free to pursue.

Table 3 Descriptive Statistics for Interviewer Ratings of Item Difficulty by Country

More than 95% of participants in the total sample answered each probe correctly (see Supplemental Table S2). Participants in 15 of the countries answered all five probes correctly. In the other seven countries where incorrect responses were identified, the percentage of participants with such responses ranged from 10% to 40%. Countries in which two probes were answered incorrectly by at least one participant included Argentina (threat to life [20%] and discriminated against [40%]), the Philippines (life worthwhile [40%] and give up happiness [20%]), and Ukraine (free to pursue [40%] and discriminated against [20%]). One or more incorrect responses to a single probe were also found in China and South Africa (give up happiness [20% each]), India (life worthwhile [10%]), and Nigeria (discriminated against [40%]). The probes that had incorrect responses across the greatest number of countries were for the give up happiness (China, Philippines, South Africa) and discriminated against items (Argentina, Nigeria, Ukraine).

3.3 Qualitative analysis

Results from the qualitative analysis are reported in Table 4. The overarching cross-country theme for each of the five survey items included (1) “finding meaning in the process” (for life worthwhile); (2) “sacrificing” (for give up happiness); (3) “unhindered pursuit” (for free to pursue); (4) “threatening experiences” (for threat to life); and (5) “belonging” (for discriminated against). These themes capture the essence of how participants in the different countries interpreted each item, with many commonalities in interpretation shared across the countries. For example, in most countries there were codes for responses to the comprehension probe for the discriminated against item that referred to religion or a particular religious community. Along similar lines, codes for the give up happiness item in almost all countries referred to the notion of making a sacrifice by delaying, setting aside, deferring, or letting go of some short-term happiness or gratification. In addition, codes for the life worthwhile item in many countries reflected the idea that doing worthwhile things in life involves engaging in or pursuing something that is valuable, important, or beneficial to oneself and/or others.

Table 4 Overarching Cross-Country Themes, Within-Country Categories, and Within-Country Codes for Comprehension Probe Responses to Each Item

However, interpretations of the items varied to some extent across countries. For example, the Philippines was the only country in which codes for the comprehension probe to the discriminated against item referenced fraternities or gangs, and codes in only a small number of countries referred to political (i.e., Egypt, Israel, Russia, South Africa) or racial/ethnic groups (i.e., Germany, Poland, United Kingdom, United States). Similarly, few countries had codes for the give up happiness item that reflected ideas about leaving one’s comfort zone, being patient, or losing something (i.e., Brazil, Russia, South Africa). As a further illustration, only a small number of countries had codes for the threat to life item that referenced topics such as the COVID-19 pandemic (i.e., Indonesia, Mexico, Philippines, United States) or safety/security concerns (i.e., Kenya, Indonesia, Japan, Mexico, Russia).

There was also evidence of some within-country variability in the responses to the probe for the items. Providing the most poignant illustration of this was Egypt for the give up happiness item, where consensus could not be reached on any codes because of the diverse range of responses that participants in the country provided to the probe for this item. As a more subtle example, codes for the life worthwhile item in Kenya were ‘satisfaction,’ ‘benefitting myself and others,’ and ‘being content,’ all of which have distinctive connotations. Along similar lines, codes for the free to pursue item in South Africa included ‘wishes in life,’ ‘persevering,’ and ‘freedom to pursue what you want,’ highlighting a plurality of interpretations for this item among South African participants.

4 Discussion

Using multinational cognitive interviewing data related to five core domains of personal wellbeing, we performed a cross-national analysis to explore item difficulty and comprehension of five GFS survey items (one for each domain) among 116 individuals from 22 countries. Our main findings are threefold. First, interviewer observations indicated that most participants in the total sample (90% or more) did not experience a lot of difficulty responding to each of the items, although difficulty ratings were somewhat heterogenous across items and countries. Second, more then 95% of participants in the total sample correctly answered the comprehension probe for each of the five items, but there were several countries in which answers to the comprehension probe for one or more items did not align with the question that was asked. Third, qualitative analysis of responses to the comprehension probe revealed an overarching cross-country theme for each item, with some variability—both between and within countries—in the way that responses to the probe connected to each overarching theme. In addition to serving as a useful resource for researchers who work with the GFS data, this study’s findings offer further evidence supporting the value of integrating cognitive interviewing into the process of conducting cross-cultural research (Willis 2015).

Although most participants did not appear to experience a lot of difficulty responding to the five survey items, interviewer observations suggested that some items (i.e., threat to life, discriminated against) were slightly more difficult for participants than others (i.e., life worthwhile, give up happiness, free to pursue). Differences in interviewer difficulty ratings across the items may be a function of various factors, such as differences in item length, response format, or level of abstraction (Holbrook et al. 2006). When any participants were identified as having a lot of difficulty responding to an item, they were often restricted to a small number of countries (e.g., 4/22 for life worthwhile). Few countries (at most three) had more than one participant who had a lot of difficulty responding to any one item (e.g., Brazil, Nigeria, and Philippines for discriminated against), and none of the countries met this criterion for multiple items. These findings provide a general indication that the five items examined in this study may be suitable for use in different contexts, with the qualification that certain items may be more challenging for participants in some contexts than others. In light of research that supports a link between item difficulty and how participants respond (Olson et al. 2019), researchers may want to consider the possibility that any between-country differences in responses to these (and potentially other) items in the GFS survey might also be a function of between-country differences in the difficulty participants experience understanding and responding to the items.

Our analysis indicated that most participants correctly responded to the comprehension probe for each item, suggesting that the comprehension probe used for these five items (which was also used for several other items in the GFS survey) generally solicited responses that could provide insight into the way the participants understood the items. However, there was at least one country for each of the items in which one or more participants did not respond correctly to the comprehension probe, with incorrect responses distributed across 7/22 countries. The give up happiness and discriminated against items had incorrect responses across more countries than any other items (three countries for each), and more than one participant in the Philippines (for life worthwhile), Ukraine (for free to pursue), Argentina, and Nigeria (for discriminated against) provided an incorrect response to the probe. There may be several explanations for these findings, including the possibility that some participants misheard or misunderstood what the probe was asking (Patel-Syed et al. 2024). It is also possible that the comprehension probe we focused on in the present study might have been more problematic for participants in some countries compared to others (Willis 2015). Research is needed to carefully evaluate these possibilities further, as there are many factors (e.g., interviewer experience or style of interviewing) that could affect whether participants provide suitable responses to probes.

Qualitative analysis of participants responses to the comprehension probe for the five items unveiled an overarching theme for each item, providing evidence of a central thread connecting participants’ interpretation of each item across the countries. This pattern of findings suggests that participants share some common ground in their comprehension of these items. However, there was also some cross-country variability in how responses from participants in the different countries related to the overarching theme for each item. Our findings resonate with prior work that has reported evidence of between-country differences in the way that individuals understand and process survey items (e.g., Benítez et al. 2018; Thrasher et al. 2011), although this study is one of the first to document such evidence for measures related to personal wellbeing using data from numerous countries around the world. Based on these findings, researchers who use the GFS data might consider carrying out analyses separately by country, treating the same item across countries as closely related, but not identical, assessments. We also found that responses to the probe for each item varied to some extent across participants within the same country, suggesting that there may be more localized variation in interpretations of these items as well. It is not unreasonable for item comprehension to vary within a country, and it may be especially prevalent in more heterogeneous and culturally diverse populations (Johnson et al. 2006). As researchers analyze, describe, and evaluate responses to these (and perhaps other) GFS survey items, they ought to carefully consider the extent to which their findings might be affected by differences in how people interpret the items. More broadly, our findings align with assertions that the rigor of cross-cultural measurement work could be strengthened by employing qualitative approaches (e.g., cognitive interviewing) alongside widely used quantitative approaches (e.g., measurement invariance testing) to uncover a deeper and more nuanced layer of insights (e.g., identifying and understanding sources of bias) about the cross-cultural validity of items and measures (Benítez et al. 2018; Broesch et al. 2020).

4.1 Limitations and future research directions

The findings of this study should be considered alongside several limitations. First, the sample in each country was small (n = 5 in almost all countries), which may not have been sufficient to reach saturation within each country (i.e., the point at which no further insights would be obtained by conducting additional interviews). Recruiting a larger sample in each country could expand the range of potentially suitable analytic options, such as the possibility of exploring cross-country variability in qualitative data using quantitative techniques. Second, sociodemographic characteristics varied to some extent across the countries, which could account for differences that were identified between countries. Third, local interviewers were used in each country. Although cognitive interviews were standardized across countries and interviewers were required to have fundamental knowledge and experience with translation, cognitive interviewing, and survey research methodology, it is unclear whether variation in results between the countries might be attributable to differences in the characteristics of the interviewers. For example, it is possible that differences in interviewer ratings of item difficulty between countries were influenced by differences in interviewer characteristics (e.g., personality traits) rather than actual differences in the difficulty that the participants had responding to the item. Research is needed to better understand the extent to which cross-cultural cognitive interviews might be affected by interviewer characteristics (Willis 2015). Fourth, we chose to center our analysis on a single comprehension probe that was used in the cognitive interviews, which principally addresses the first of the four primary cognitive processes that are commonly used to explain how people interpret and respond to survey items (Lenzner et al. 2023). Our decision to focus on comprehension was largely based on practical considerations (e.g., it was rare that a survey item was accompanied by more than one probe, and the comprehension probe we selected was used more frequently in the cognitive interview protocol than other probes). However, future work should consider using a combination of probes that clearly tap into all four cognitive processes to ensure the suitability of a survey item can be evaluated more comprehensively. Fifth, we selected the constant comparative method to guide our qualitative analysis because it is particularly well suited to exploring similarities and differences across groups (Charmaz & Thornberg 2021). However, we cannot rule out the possibility that other insights might have emerged if an alternative analytic approach (e.g., thematic analysis) had been applied to the qualitative data. Sixth, we found that some participants in selected countries did not answer the comprehension probe correctly, which is one of the drawbacks of using a standardized approach to conduct cross-national cognitive interviews with structured probes. Given the small sample size within each country, our understanding of how participants in some countries might have interpreted certain items may be more limited. To enhance the utility of cognitive interview data, it may be important to ensure that cross-national cognitive interview protocols are implemented in a way that balances standardization with the flexibility to make modifications in certain circumstances. For example, even when a cognitive interview protocol is entirely scripted, interviewers could be encouraged to deviate from the protocol by employing techniques (e.g., spontaneous probes) that may help to maximize opportunities to learn from participants (Scott et al. 2021; Toni et al. 2024).

5 Conclusion

In summary, we used multinational cognitive interviewing data from the survey development phase of the GFS to explore similarities and differences in item difficulty and comprehension for five items that are linked to five core domains of personal wellbeing. Interviewer observations indicated that most participants did not find it very challenging to respond to these five items, and almost all participants responded correctly to the comprehension probe for each item. Our qualitative analysis of comprehension probe responses suggested that there is a common core or essence to participants’ interpretation of each item that is generally shared across the 22 countries, although there was also evidence of some between- and within-country variability in how participants interpreted each item. We hope that these findings will prove useful to researchers who plan to use data from the GFS to study human wellbeing and related topics, as well as to scholars who are interested in cross-cultural research more generally.