Comparing Likert Scale Functionality Across Culturally and Linguistically Diverse Groups in Science Education Research: an Illustration Using Qatari Students’ Responses to an Attitude Toward Science Survey

  • Ryan Summers
  • Shuai Wang
  • Fouad Abd-El-Khalick
  • Ziad Said


Surveying is a common methodology in science education research, including cross-national and cross-cultural comparisons. The literature surrounding students’ attitudes toward science, in particular, illustrates the prevalence of efforts to translate instruments with the eventual goal of comparing groups. This paper utilizes survey data from a nationally representative cross-sectional study of Qatari students in grades 3 through 12 to frame a discussion around the adequacy and extent to which common adaptations allow comparisons to be made among linguistically or culturally different respondents. The analytic sample contained 2615 students who responded to a previously validated 32-item instrument, 1704 of whom completed the survey in Modern Standard Arabic and 911 in English. The purpose of using these data is to scrutinize variation in the performance of the instrument between groups of respondents as determined by language of survey completion and cultural heritage. Multi-group confirmatory factor analysis was employed to investigate issues of validity associated with the performance of the survey with each group and to evaluate the appropriateness of using this instrument to make simultaneous comparisons across the distinct groups. Findings underscore the limitations of group comparability that may persist even when issues of translation and adaptation were heavily attended to during instrument development.


Attitudes toward science Cross-sectional Multi-group CFA Translation Validity 

With their versatility and capacity for exploration, description, or explanation (Vaske, 2008), coupled with potential for broad coverage (Campbell & Katona, 1953), surveys are an attractive methodological option. Science education researchers have drawn attention to the large number of instruments that have been developed, validated, and administered in pursuit of a broad array of goals (e.g. Boone, Townsend & Staver, 2011). In designs involving surveys, researchers frequently plan to address questions that involve more than one group (e.g. ethnic, cultural, and linguistic), for direct or indirect comparisons, and they may wonder if any causal relationships uncovered hold across the different groups. To address questions of this nature, it is important that the researcher first knows whether an instrument is valid for studying different groups of interest (Wang & Wang, 2012). A concern has been raised in the science education research literature with respect to both language translation and cultural adaptation of survey instruments (Amer, Ingels & Mohammed, 2009).

Methodological considerations—namely, instrument translation and administration to ethnically and culturally diverse groups—and the efficacy of such efforts are a serious issue with implications for global scholarship. Using a recent and rigorously developed instrument, made available in two languages, this paper explores the fundamental issue of best practices for producing valid and reliable translations for use in multiple contexts. Specifically, the question is whether even these practices were enough to ensure comparability of responses. Cross-sectional data from Qatari precollege students about their attitudes toward science, collected on a forced-choice survey utilizing a 5-point Likert response format, is used as an illustrative case-in-point example. This manuscript provides stepwise comparisons on the basis of survey language and cultural affiliation in Qatar. Qatar, a nation with a total population of over two million, is naturally able to supply the variety of responses desired for this comparison. Approximately 40% of the population in Qatar is Arab, either Qatari citizens or Non-Qatari Arabs, and the remaining 60% is comprised of Non-Arabs who live and work in Qatar (Central Intelligence Agency, 2013). The official spoken language in Qatar is Arabic, and English is a commonly used second language, but a variety of other languages are often used by expats. More than ethnic and linguistic diversity, Qatar exemplifies considerable cultural diversity between the groups residing within its borders. In this context, the present study aims to investigate the possible impacts on the trustworthiness of Likert-type assessments that are used across linguistic and cultural boundaries—be it within the same national context or across nations.

The “Qatari students’ Interest in, and Attitudes toward, Science” (QIAS) project was organized to identify and examine factors that impact student attitudes toward science across the precollege learning experience. Over the past 25 years, Qatar has made a concerted effort to move toward a knowledge-based economy anchored in scientific research production (Qatar Foundation, 2009). However, the relatively low number of students pursuing science and engineering at the college level threatened this goal. The QIAS project aimed to better understand students’ perceptions of science and their intentions to continue studying science at the post-secondary level. To achieve this goal, QIAS adopted a cross-sectional design (Gall, Borg & Gall, 1996) drawing from a random nationally representative sample. Previous publications from this project have centered on instrument development and validation (Abd-El-Khalick, Summers, Said, Wang & Culbertson, 2015), and the analysis of student data collected in Arabic. This latter study into Qatari students’ attitudes toward science, and related factors, reiterated the importance of nuanced examinations of context and culture given discrepancies observed between Qatari Arabs and Non-Qatari Arabs (Said, Summers, Abd-El-Khalick & Wang, 2016). The present manuscript utilizes responses collected from Qatari students about their attitudes toward science, and related constructs, to highlight methodological and assessment issues, namely, instrument translation and administration to diverse groups and the efficacy of such efforts as they relate to the assessment of students’ attitudes toward science by self-report.

Concerns About Instrument Validity and Translation Practices

General concerns about the psychometric properties, including validity, of instruments purporting to measure students’ attitudes toward science are well documented (Blalock, Lichtenstein, Owen, Pruski, Marshall & Topperwein, 2008) and persistent (Potvin & Hasni, 2014). The use of survey data collected on precollege students’ attitudes toward science, and related constructs, is similarly well suited—arguably even demanding of—this type of investigation. Concerns about instrumentation and validity are further compounded by the limited consideration given to the large body of literature surrounding cross-cultural translation (e.g. Brislin, Lonner & Thorndike, 1973). Common practice, it seems, is to take instruments, including those intended to assess students’ attitudes toward science, that have been developed in a Western context and administer them elsewhere with little regard for content or psychometric validity. Guillemin, Bombardier and Beaton (1993) illustrate multiple scenarios in which the original, or source, language must be carefully adapted for use with a differing target population. The authors describe the most extreme example as involving the administration of an instrument in a different culture, language, and country. A synthesis of research by Beaton, Bombardier, Guillemin and Ferraz (2000) from sociological, psychological, and medical literature offers the following summary of steps for cross-cultural adaptation: forward translation into the target language, review and address issues arising, back translation from target to source language, review by content and language experts, and testing ideally coupled with participant feedback.

In terms of procedure, Harkness and Schoua-Glusberg (1998) explicate that forward, or direct translation, from source to target language (McKay et al., 1996) is the simplest and least costly, but carries a host of disadvantages. What might appear to be an attractive option with only one translator or bilingual researcher involved comes with an overreliance on the individual’s perceptions, skills, and awareness of any relevant regional differences (Sechrest, Fay & Zaidi, 1972). Nonetheless, there are many cases wherein translation involved one (e.g. Rashed, 2003) or a small group (e.g. Turkmen & Bonnstetter, 1999) of bilingual researchers and/or assistants in the adaptation of a previously validated English-language instrument to target language of interest. Perhaps, most disconcerting are cases where detailed methodological discussion is absent—an occurrence noted in multiple disciplines (Liaghatdar, Soltani & Abedi, 2011; Sperber, Devellis & Boehlecke, 1994).

Concerns About Translation Attitudes Toward Science Instruments

Extant studies aiming to assess students’ attitudes toward science have often aimed to provide cross-cultural comparisons, and consequently issues related to cross-cultural validity are abundant. Many studies using the Test of Science-Related Attitudes (TOSRA), a well-regarded and widely used instrument first developed for use in Australia (Fraser, 1981) and later used in the USA (Khalili, 1987), exemplify the concerns previously outlined. Since its inception, the TOSRA, which includes seven sub-scales, has been administered in a variety of contexts, in its entirety or a portion thereof, including translations in Bahasa Indonesia (Adolphe, 2002), Mandarin (Webb, 2014), Spanish for administration in Chile (Navarro, Förster, González & González-Pose, 2016), Thai (Santiboon, 2013), Turkish (Curebal, 2004), and Urdu for use in Pakistan (Ali, Mohsin & Iqbal, 2013). Some of these studies omit information about the translation of the instrument (e.g. Santiboon, 2013) or fail to provide enough information to judge the quality of the translation (e.g. Curebal, 2004). Other efforts relied on a sub-standard translation practices (i.e. single translator) without verification of the adaptation (e.g. Navarro et al., 2016; Webb, 2014). There have been some ambitious efforts, notably the ROSE Project (Schreiner & Sjǿberg, 2004), that transparently detail informed survey translation practices employed, including back-translation and piloting (Jenkins & Pell, 2006), but these are exceptions. Much of the extant literature discussing measures of students’ attitudes toward science did not adhere to the growing list of strategies and recommendations for increasing cross-cultural or sub-group comparability (e.g. Harkness, Van de Vijver & Mohler, 2003; Jowell, Roberts, Fitzgerald & Eva, 2007; Presser et al., 2004). Even in the few cases where the translation practices employed appear to be aligned with recommendations from the literature, there are ancillary concerns about the small sample used to establish validity of the translated and/or modified instrument (e.g. Lowe, 2004; Webb, 2014) and, arguably, the application of appropriate, modern analyses to validate the measure with the target group of interest (e.g. Ali et al., 2013; Navarro et al., 2016).

Related Concerns About Scale Reliability

Concerns, related to demonstrating cross-cultural validity, include the superficial focus by many authors on scale reliability alone as an adequate indicator of instrument comparability (Amer et al., 2009). “An instrument cannot be valid unless it is reliable” (Tavakol & Dennick, 2011), but reliability is distinct from validity, previously described, and represents the extent an instrument measures what it is intended to measure. Lovelace and Brickman (2013) concisely define reliability as the “consistency or stability of a measure,” and note that it reflects “the extent to which an item, scale, test, etc., would provide consistent results if it were administered again under similar circumstances” (p. 611). Alpha was developed by Lee Cronbach in 1951 to provide a measure of the internal consistency of a test or scale (Cronbach, 1951), and it is still a widely used measure of reliability (Tavakol & Dennick, 2011). It is important to explicate that while Cronbach’s alpha is widely used, it can easily be misinterpreted or used in a way that yields an inaccurate value (Streiner & Norman, 1989). The formula used to compute the alpha, and its function, are dependent not only on the magnitude of the correlations among items, but also on the number of items in the scale. Simply put, even poorly constructed instruments can provide acceptable alpha values if there are numerous items in each of the sub-scales, or if the sample size is very large (Cronbach, 1951). Instruments, or sub-scales, containing correlated items can also inflate the alpha (Tavakol & Dennick, 2011). Discussions about the proper use and interpretation of alpha are provided by Cortina (1993), Nunnally and Bernstein (1994), and Schmitt (1996).

Lovelace and Brickman (2013) are careful to qualify that reliability is a measure of internal consistency when administered under similar circumstances. Tavakol and Dennick (2011, p. 53) elaborate by explaining “alpha is a property of the scores on a test from a specific sample of testees.” Complications in estimating alpha, along with improper use and/or poor interpretation, illustrate the potential for instrument comparisons solely on the basis of scale reliability to be misleading. This concern is exemplified by the work of Navarro et al. (2016) who surveyed 664 secondary school science students, utilizing the TOSRA translated into Spanish, defending the performance of the instrument, and compared their findings in Chile to studies conducted in Australia (Fraser, 1981) and New Zealand (Lowe, 2004). Again, this is problematic as a number of recent studies involving the TOSRA continue to follow in form, and rely on alpha and scale structure as a defense of cross-cultural validity (e.g. Ali et al., 2013). In light of the examples of ambiguity related to translation presented above, and subsequent concerns about methods used for determining validity in new contexts, underscore the need for investigating what practices are sufficient to ensure comparability across groups who respond to survey instruments by self-report. The comparability of measures across distinct groups of respondents is a key methodological issue, which has rarely been interrogated using empirical data, and will be investigated in the present study.

Challenges of Assessing Students’ Attitudes Toward Science in a Diverse Context

Concordant findings from multiple studies have called attention to an observable decrease in the interest of young people in pursuing a science-related careers across the globe (Gokhale, Rabe-Hemp, Woeste, & Machina, 2015; Osborne, Simon & Collins, 2003; Tomas & Ritchie, 2015; Tytler & Osborne, 2012). Researchers concerned about the low numbers of students that have elected to pursue a college major in the science researchers around the world—such as Lyons (2006) and Osborne et al. (2003)—have worked diligently to systematically examine students’ attitudes and interests related to science. From these efforts, evidence has been presented suggesting students’ attitudes toward science are significantly differentiated according to individual factors, such as age (Pell & Jarvis, 2001) and gender (Brotman & Moore, 2008; Osborne et al., 2003), but also more broadly across factors like socio-economic status and cultural background (Brickhouse & Potter, 2001). The literature of survey performance in cross-cultural contexts stresses the necessity of considering context by emphasizing the possibility that observed differences in attitudinal data, between individuals or groups, may be the result of the measures and scales being used to collect such information (King, Murray, Salomon & Tandon, 2003; Watkins & Cheung, 1995).

Research Questions

In the present study, measured outcomes, such as the perceived value of science education and the importance of pursuing science-related careers, carry the underlying assumption that these distinct constructs are consistent for all of the cultural groups in Qatar, even though others have made arguments to the contrary (e.g. Amer et al., 2009). With a potential for differences between Qataris, and some Non-Qatari Arabs, compared to Non-Arabs residing in Qatar, coupled with methodological issues detailed in the preceding section, a systematic approach is necessary to verify the appropriateness of making simultaneous group comparisons. The present study details the application of one possible approach using data collected about students’ attitudes toward science, and related constructs, specifically addressing the following questions:
  1. (1)

    Are the Arabic and English versions of the ASSASS instrument functionally similar given the practices used to develop the two versions of the instrument?

  2. (2)

    Does the ASSASS instrument perform differently on the basis of cultural heritage as judged by comparing responses collected in the same language (Qatari vs. Non-Qatari Arab, Non-Qatari Arab vs. Non-Arab)?

  3. (3)

    Does survey language alone impact the performance of the ASSASS instrument as determined by comparing responses collected from students with similar cultural heritage in their preferred language (English or Modern Standard Arabic)?




Participants from grades 3 through 12 completed the “Arabic Speaking Students’ Attitudes toward Science Survey” (ASSASS, which transliterates into “foundation” in Arabic) as part of the larger research project, QIAS. The project efforts started with a thorough review of the literature related to measuring precollege students’ attitudes toward science. This review did not produce any instruments that were adequate for the purposes of the QIAS project. Instead, instruments were uncovered that were not specifically developed and rigorously validated for the purpose of assessing attitudes toward science, and related factors, among Arabic-speaking students. Moreover, this review converged on a set of problems among nearly all of the existing (English language) instruments that limited their applicability for cross-sectional study designs. For example, many extant instruments were designed to assess student attitudes within specific grades or grade bands rather than across the elementary, middle, and high school grades.

ASSASS Development and Validation

A major undertaking of the QIAS project was the development of an instrument that would be appropriate for the aims of the project, could collect responses from students across a range of grades, and was anchored in a robust theoretical framework. The development and validation of the ASSASS proceeded in three phases. First, a ten-member international expert review panel helped establish the face validity of an initial pool of 60 ASSASS items, which comprised items derived from several extant attitude-toward-science instruments, as well as items developed by the authors. Second, the initial pool of items was piloted with a sample of Qatari students from the target schools and grade levels. Finally, statistical validation of the instrument and its underlying structure were based on data derived from a nationally representative sample of students in Qatar (Abd-El-Khalick et al., 2015). During this process, the reliability of the instrument for the broad age of respondents was checked through comparisons of instrument performance with students at the younger end of the spectrum (grades 3 and 4) and the older students (grades 11 and 12) with no observable discrepancies detected (Borgers, Leeuw & Hox, 2000; Borgers, Sikkel & Hox, 2004; Kellett & Ding, 2004).

Translation into MSA

To be accessible to all students in Qatar—where the population includes Qatari Arabs, non-Qatari Arabs, and non-Arab residents—the ASSASS instrument was made available in both English and Modern Standard Arabic, the official language of teaching and learning in Qatar and Arab nations (Abd-El-Khalick et al., 2015). There are three important features of the translation process used to produce the ASSASS instrument in English and MSA that are worth emphasizing. These considerations, which correspond to the concerns raised by Harkness and Schoua-Glusberg (1998), relate to the timing of the translation, the flexibility of the source language, and the measures taken to ensure comprehension. The ASSASS instrument, as mentioned above, draws heavily from existing instruments designed in English for an English-speaking audience. It is important to note that translation into MSA was part of the initial plan and that translational issues were considered at multiple points in the development process. As such, many items were modified, or generated by the instrument design team (see Authors, 2015) and, thus, allowed for linguistic flexibility if warranted. Sperber et al. (1994) use the term decentering to refer to a situation that allows for an ongoing process of revision to occur during translation, leading to similar and culturally relevant instrument versions. In general, modifications related to translation or readability centered around one of two issues: words that did not translate in a meaningful manner or words inappropriate for a given context. These issues were addressed by bilingual members of the ASSASS design team, including team members who are familiar with the idiosyncrasies of Gulf Arabic and common colloquial language used in Qatar. Additionally, the members of the expert review panel, five of which who were bilingual, provided feedback on the survey translation. This linguistic flexibility of the source language, in addition to supporting an appropriate translation, also allowed for a number of considerations to allow for survey responses to be captured from a broad age-range of respondents, particularly younger students, by adhering to common recommendations for language, length, and level of abstraction in survey items (Borgers, Leeuw & Hox, 2000; Kellett & Ding, 2004). To help ensure that Qatari students were comfortable with the items as translated, a sub-sample of students were asked to interpret a subset of ASSASS items following the survey administration and reported no major concerns during the pilot phase of the instrument development process (Abd-El-Khalick et al., 2015).

Finalized Instrument

The ASSASS instrument (Abd-El-Khalick et al., 2015) comprised 32 item-statements. Using a 5-point Likert scale, each statement asked students to indicate a degree of agreement or preference with a number that ranged from “1” (i.e. strong disagreement or low preference) to “5” (i.e. strong agreement or high preference) with a rating of “3” indicating that students were not sure, or neutral, about their choice or preference. The instrument also contained a number of questions to solicit background and demographic information from students. Analysis of the large-scale administration data led to the refinement of a five-factor model, which included the following factors: attitudes toward science and science learning, unfavorable outlook toward science, control beliefs, behavioral beliefs about the benefits of science, and intentions to pursue or engage in science in the future. The ASSASS instrument final model, obtained through confirmatory factor analysis (CFA) and subsequent refinement, had a close fit as judged by a standardized root mean square residual (SRMR) of 0.037, a comparative fit index (CFI) of 0.937, and a Tucker-Lewis index (TLI) of 0.931 (Bentler & Bonett, 1980; Hu & Bentler, 1999). The five ASSASS factors or sub-scales are as follows: (1) “Attitude,” which comprised student attitudes toward science (e.g. “I really like science”) and toward school science learning (e.g. “I really enjoy science lessons”); (2) “Control beliefs,” which addressed respondents’ perceived ability and self-efficacy toward science learning (e.g. “I am sure I can do well on science tests”); (3) “Behavioral beliefs,” which pertain to beliefs about the consequences of engaging with science, including becoming a scientist (e.g. “Scientists do not have enough time for fun”) and beliefs about the social and personal utility of science (e.g. “We live in a better world because of science” and “Knowing science can help me make better choices about my health”); (4) “Unfavorable outlook” on science, which represented an amalgam of negative dispositions toward school science, perceived ability to learn science, and the personal and societal utility and contributions of science; and (5) “Intention,” which probed respondents’ intentions to pursue additional science studies (e.g. “I will study science if I get into a university”) or careers in science (e.g. “I will become a scientist in the future”) (see Abd-El-Khalick et al., 2015 for a detailed discussion).

Study Context

General concerns about low levels of scientific research and production in Arab countries (United Nations Development Programme, 2003) and related concerns about the dismal number of Arab students enrolling in scientific disciplines in higher education (The World Bank, 2008) have, in part, led Qatar to commit to strengthening its national science education pipeline. The Qatar National Vision 2030 (General Secretariat of Development Planning, 2010) stated that in order for Qatar to become a developed nation and move toward a knowledge-based economy, it is necessary to cultivate citizens capable of interacting with science, mathematics, and technology. To achieve this goal, educational changes in Qatar have been initiated that target both the K-12 and post-secondary levels. First, to rejuvenate the K-12 educational system in Qatar, the “Education for a New Era” reform was initiated (Zellman et al., 2007). As part of the reform, new precollege school science curriculum standards in Arabic, mathematics, science, and English were established for all grade levels. These new curriculum standards are comparable to the highest in the world, and the mathematics and science standards were published in Arabic and English to make them accessible to the largest group of educators (Brewer et al., 2007).

Participants: a Nationally Representative Sample

As part of the QIAS project, the ASSASS was administered to a nationally representative sample of students in grades 3 through 12 (Abd-El-Khalick et al., 2015). In order to draw a nationally representative sample, all schools registered with the Qatari Ministry of Education were contacted to request information about enrollments, including the number of class sections per grade level. A total of 194 schools (65%) provided the requested information, which was used to generate a database of 3241 class sections comprising all sections in grades 3 through 12 across all respondent schools and school types. Next, four sections per grade level (in grades 3 through 12) and school type were randomly selected from this database resulting in a sample of 200 class sections. Responses to the ASSASS (Table 1) were collected from a total 3027 students (51.2% female, 45.3% male, 3.4% unreported) in 144 sections (72% sectional response rate) from 79 different schools. Respondents were 31.4% Qatari, 33.2% non-Qatari Arabs, and 29.9% with “other” nationalities, while 5.5% of the respondents did not report their nationality. A total of 1978 respondents (65.3%) completed the survey in Arabic. Of those, 88.2% were Qatari and non-Qatari Arabs, and 7.4% were from other nationalities (6.5% unreported). Of the 1049 students who completed the survey in English, only 9.5% were Qatari and 14.4% non-Qatari Arabs.
Table 1

Representative sample of ASSASS respondents in Qatar (N = 3027)



Survey language










Not reported














School level













































































































































































 Grand total













1Percent of grand total

2Percent of corresponding grade or school level

Data Analysis

The translation of the ASSASS, and related critical considerations, resulted in a favorable scenario in terms of making an instrument available for linguistically diverse populations (Harkness & Schoua-Glusberg, 1998). This study aims to examine the effectiveness of these methodological considerations and the resultant performance of different language versions using responses collected from the linguistically and culturally diverse students residing in Qatar. The research questions allow for the critical examination of the related issue of survey validation with respect to language of survey completion and cultural heritage in tandem (RQ1) and in isolation (RQ2 and RQ3). Specifically, it was important to determine whether the structure and causal relationships that were found in the Arabic version of the ASSASS would be maintained in the English version. To address these questions and to investigate whether or not the ASSASS instrument is valid for studying these different populations simultaneously, multi-group confirmatory factor analysis was employed. Multi-group CFA, akin to multi-group SEM (Wang & Wang, 2012), is designed to examine population heterogeneity and address questions of whether relationships hold across different groups or populations (p. 207). Multi-group CFA can be used to accurately test the invariance of measurement scales (Bollen, 1989; Hayduk, 1987; Sorbom, 1974), and this test is necessary to ensure that scale items measure the same constructs for all groups (Wang & Wang, 2012). Only if measurement invariance holds can findings of differences between groups be unambiguously interpreted (Horn & McArdle, 1992).

Before beginning this testing process, it is essential to establish for each group a baseline CFA model, one that is both parsimonious and theoretically meaningful, and then these baseline models are integrated into a multi-group CFA model (Wang & Wang, 2012). The presentation of results and related discussion refers to the establishment of baseline CFA models is termed step 1. This application of CFA tests the fit of a hypothesized model to determine if the factorial structure is valid for the population (Byrne, 2006). However, in this case, the test for factorial validity of the measuring instrument is being applied to multiple versions of the same survey, completed by different groups of the sample. Using the multi-group CFA model, also known as a configural CFA model, the four levels of measurement invariance are tested stepwise in hierarchical fashion for each of the groups involved (Meredith, 1993; Widaman & Reise, 1997). Testing measurement invariance is a process that involves examining (a) invariance of patterns of factor loadings, (b) values of factor loadings, (c) item intercepts, and (d) error variances (Meredith, 1993; Widaman & Reise, 1997). For the purpose of this investigation, should the model fail a given level further tests are unwarranted (Wang & Wang, 2012). (Note there are cases of partial invariance, but they do not apply here [see Byrne, 2008].) The four parts of this process, identified as steps 2–5, start by examining if the number of factors, or constructs, and patterns of factor loadings, or clustering thereof, are the same across all groups. This process and associated implications for interpretation are summarized in Table 2.
Table 2

Overview of measurement invariance testing using multiple group CFA





Establish baseline CFA models to compare with multi-group CFA model

If baseline models cannot be created for the groups being compared, it is impossible to establish the multi-group CFA model required for further analysis


Examine invariance of patterns of factor loadings

Failure indicates that compared groups respond in patterns resulting in a differing number or dissimilar constitution of factors


Examine values of individual factor loadings

Failure suggests that individual items contribute differently to their respective factor across groups


Examine individual item intercepts

Failure indicates that participants in at least one group systematically respond differently (e.g. higher or lower) when compared to the other group(s)


Test for invariance of error variance values

Satisfying the highest level of scrutiny requires that similar error variance across is demonstrated by groups being compared

For a more detailed discussion of step 1, see Wang and Wang (2012). For steps 2–5, refer to Meredith (1993) and Widaman and Reise (1997)

For steps 3 through 5, the hierarchical steps of testing measurement invariance and structural invariance require that different restrictions are imposed on specific models being compared. At each testing step, comparisons are made between restricted and unrestricted models. Step 3 tests the invariance of factor loadings across all groups by considering the strength of the relationship between individual items and their underlying factors. To investigate potential differences in factor loadings for two models, a scaled likelihood ratio test could be used; however, because the maximum full likelihood robust (MLR) estimator was used in Mplus, the likelihood ratio cannot be performed directly (Wang & Wang, 2012). A scaled difference in chi-square was leveraged using the equation below:
$$ {TR}_{\mathrm{d}}=\left({T}_0-{T}_1\right)/{c}_{\mathrm{d}} $$
The scaled likelihood TRd represents the scaled difference in chi-squares between null (T0) and alternate (T1) models, and cd the difference test scaling correction. The scaling correction factor was obtained from Mplus for all warranted comparisons, calculated as represented below:
$$ {c}_{\mathrm{d}}=\left[\left({d}_0\ast {c}_0\right)-\left({d}_1\ast {c}_1\right)\right]/\left({d}_0-{d}_1\right) $$
In this equation, d0 and c0 are the scaling correction factor and the degrees of freedom for the null model and, respectively, d1 and c1 are the same variables from the configural model. Substituting the related values from the previous two equations yields:
$$ {TR}_{\mathrm{d}}=\left({T}_0-{T}_1\right)\left({d}_0-{d}_1\right)/\left[\left({d}_0\ast {c}_0\right)-\left({d}_1\ast {c}_1\right)\right] $$

The resultant likelihood ratio test can be used to determine if two models, instrument versions and/or response groups in this case, had significant differences. Step 4 of the process considers item intercepts as indicator that participants in at least one group tend to respond systematically higher or lower to the items in the scales used. Fulfilling the invariance tests to this point is required to make the case for measurement invariance across multiple groups. The final level of testing, step 5, looks for invariance in error variance, but it is important to note that many consider this level of scrutiny unnecessary (Bentler, 2005; Byrne, 2008).


Comparison 1: Arabic Versus English

In step 1, the Arabic and English versions of ASSASS were found to have close model fits as judged by the fit statistics from the baseline CFA computed (Table 3, models A and B). In step 2, the configural model, testing the two different language models together, resulted in an acceptable fit with a root mean square error of approximation (RMSEA) of 0.034, SRMR of 0.041, CFI of 0.933, and TLI of 0.927. Note RMSEA and SRMR values < 0.06 indicate close approximate fit (Hu & Bentler, 1999), and CFI and TLI values > 0.9 indicate reasonably good fit (Bentler & Bonett, 1980). To determine whether the factor loadings were the same for the Arabic- and English-language models, it was necessary to perform a scaled likelihood ratio test in step 3.1 A scaled difference in chi-square was computed as described in the “Data Analysis” section using the scaling correction factor for MLR, 1.23, inserting the relevant values:
$$ {TR}_{\mathrm{d}}=\left(2576.62-2492.78\right)\left(935-908\right)/\left[\left(935\ast 1.23\right)-\left(908\ast 1.23\right)\right]=68.16 $$
Table 3

Baseline CFA models for Arabic and English versions of ASSASS for group sub-samples


Survey language






























































Groups abbreviated Qatari (Q), Non-Qatari Arab (NQA), and Non-Arab (NA). Fit judged by root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR) values < 0.06 indicating close approximate fit. Comparative fit index (CFI) and Tucker-Lewis index (TLI) values > 0.9 indicate reasonably good fit (Bentler & Bonett, 1980)

Considering the difference in degrees of freedom (df = 935 − 908 = 27), the resultant likelihood ratio test revealed the factor loadings between the Arabic- and English-language instruments had significant differences (p < 0.001). Thus, we conclude that the comparison of Arabic and English versions of the ASSASS did not satisfy the conditions of step 3. Although steps 1 and 2 were satisfied in the analysis, failing step 3 indicates that individual survey items contribute differently to a statistically significant degree, on their respective sub-scales for the different language instrument versions as revealed by comparing MSA and English responses. Note these data collected from both language versions could be modeled together in an acceptable configural model, as previously presented. Following the tradition of Schreiber, Nora, Stage, Barlow and King (2006), the power of the study was evaluated by calculating the ratio of sample size to number of free parameters. For responses collected in MSA by Qatari and Non-Qatari Arabs (n = 1978), the number of estimated parameters was 106. The N:parameter ratio was 19, exceeding the general threshold for sample size requirement (i.e. 10), indicating the size was adequate.

Comparison 2: Arabic Version ASSASS, Qatari Versus Non-Qatari Arab Responses

Step 1 for comparing the responses collected from Qatari and Non-Qatari Arabs on the Arabic version of the instrument yielded close fitting baseline CFA models for each group of respondents (Table 3, models C and D).2 Continuing to step 2 in the analysis, the configural model for testing the two different cultures within the same language of survey completion resulted in an acceptable fit with a RMSEA of 0.034, SRMR of 0.040, CFI of 0.936, and TLI of 0.930. Following the same procedure detailed above for calculating the scaled likelihood statistic in step 3, with scaling correction factors of 1.219 and 1.224 for each respective model, c0 and c1 included in the equation above, using the MLR estimator, survey use with these two groups did not result in significantly different factor loadings (p = 0.515). Despite the similarity of factor loadings found in step 3, comparisons involving the configural model for these two groups revealed in step 4 that the item intercepts did significantly differ (p < 0.001). We conclude that the Arabic version of the ASSASS, used with Qatari and Non-Qatari Arabs, did not fulfill the conditions of step 4. By satisfying steps 1 through 3, the overall sequence of item loadings is maintained on each of the established factors. Analysis of responses to the Arabic version of the instrument, at step 4, highlighted that one group of students, either Qatari or Non-Qatari Arabs in this case, respond systematically higher or lower to at least some items on the ASSASS. Generally, it is expected that individual item performance may differ across survey administrations, with some variability in factor loadings, but the sequence of factor loadings should be consistent. Comparing Qatari and Non-Qatari Arab responses seems possible, given the acceptable fit of the configural model and satisfaction of multi-group CFA through step 3.

Comparison 3: English Version ASSASS, Non-Arab Versus Non-Qatari Arab Responses

Step 1 for comparing the responses collected from Non-Arab students on the English version of the instrument yielded a marginal fitting baseline CFA model (Table 3, model E). The baseline CFA model for Non-Qatari Arabs completing the English version of the ASSASS had an inadequate fit as indicated by CFI and TLI values below the 0.9 threshold (Table 3, model F). Without satisfactory baseline CFA models, a configural model could not be constructed, thus stopping the comparison in step 1. In this case, it is plausible that the comparably small sample sizes of Non-Qatari Arabs who responded to the English version of the ASSASS are culpable to the inability to establish an adequate baseline CFA model, as indicated by the small N:parameter ratio in Table 3. It is important to highlight that the English version of the ASSASS has potential for use with Non-Arab students, as evidenced by the following fit indices: RMSEA of 0.040, SRMR of 0.051, CFI of 0.901, and TLI of 0.892, even if the present study does not extend the comparability of these data to other groups.

Comparison 4: Non-Qatari (Arabic and English Surveys)

Similar to the situation in the previous comparison, efforts to compare Non-Qatari Arabs across both survey languages were halted early. Although a baseline CFA had already been satisfactorily established for Non-Qatari Arabs who completed the Arabic version of the instrument, the model fit for this group on the English version remained inadequate (Table 3, models D and F). This comparison could not be completed due to complications in step 1. Again, as discussed in reference to the previous comparison, it seems that the small sample sizes of Non-Qatari Arabs who responded to the English version of the ASSASS were detrimental to the formation of a baseline CFA model, as indicated by the small N:parameter ratio in Table 3.


The present study is unique because it allowed for a structured comparison of instrument performance on the basis of language and culture. The ASSASS instrument used to collect responses from students is distinguished from many prior instruments—and adaptations thereof—because both language versions were developed simultaneously by a research team comprised of bilingual experts familiar with the Qatari context. Multi-group confirmatory factor analysis was used to investigate whether or not the ASSASS instrument is valid for studying Qatari, Non-Qatari Arabs, and Non-Arabs, simultaneously in two different languages. The five-step process applied in this study to test measurement invariance (Meredith, 1993; Widaman & Reise, 1997), with a particular focus on identified instrument factors, is considerably more rigorous than comparisons of scale reliabilities made by authors of past publications (e.g. Amer et al., 2009). It is important to note, as a springboard to open a dialog about the level of rigor expected for survey translation in science education research, that fulfilling steps 1 and 2 during the data analysis exceeds the standards of previously published survey translations.

In this study the comparison of ASSASS instruments on the basis of language, Arabic versus English, and the comparison of groups who responded to the Arabic version, Qatari and Non-Qatari Arabs, both satisfied the criteria for step 2. Examining student responses to the Arabic version of the ASSASS, comparing Qatari and Non-Qatari Arab respondents, revealed a greater similarity in instrument performance across the distinct cultural groups as evidenced by the successful fulfillment of step 3. These results indicate that the ASSASS generates valid, reliable, and similarly interpretable results when used to compare students who completed the survey in the same language. From a previous study examining key predictor variables of students’ scores on the Arabic version of the ASSASS, a general pattern of Non-Qatari Arabs harboring more positive attitudes toward science compared to Qatari Arabs was observed in a multiple indicators multiple causes (MIMIC) model (Said et al., 2016). A MIMIC model is appropriate for examining continuous variables (e.g. age) and capable of examining non-invariance in factor means, but it cannot investigate systematic issues related to non-invariance to the same degree as multi-group CFA. Multi-group CFA also enables testing of non-invariance in all the measurement parameters and structural parameters (Wang & Wang, 2012). The selected methods and applications shed new light on this previous work. Methodologists note that certain observed differences at the interpersonal or sub-group level in cross-cultural survey investigations could be an artifact of the Likert scale measurement (Chen, Lee & Stevenson, 1995; Poortinga, 1989; Van de Vijver & Leung, 1997). Given the results of the present study regarding the performance of the MSA language version of the survey with Qatari and Non-Qatari Arabs, and considering their similar cultural and linguistic heritage, it seems plausible that societal factors, or other identifiable variables, actually account for the differences reported by authors. Still, any systematic variation in student responses between sub-groups likely warrants further investigation—both to ensure the reliability of the instrument and to progress toward the overall goal of understanding, and improving, all Qatari students’ attitudes toward science.

Efforts to compare groups of respondents within the English-language survey were largely inconclusive. The nationally representative sample included in the dataset was random at the class (or section) level. Individual classroom teachers, taking into account the normal language of instruction and atmosphere of the class, were allowed to select the language of the survey administered. There was no intervention on the part of the researchers to ensure equity in group size; instead, the focus was placed on obtaining reliable responses by allowing students to complete the survey in a familiar language as suggested by Harkness and Schoua-Glusberg (1998). The size of the Non-Qatari Arab group on the English version could be judged sufficient for validation purposes by established norms (e.g. subject to variable ratio of 2 [Kline, 1979, p. 40]); it was still smaller than any of the other individual groups. With this limitation, it cannot be determined whether students’ comprehension of the English language, or other cultural differences that coincided with their presence in a class that elected to complete the survey in English, contributed to the inadequate model fit. An alternative explanation, considering that the model fit for Non-Qatari Arabs on the English version of the ASSASS was far poorer that Non-Qatari Arabs on the Arabic version, is that some students might have been compelled to self-select to complete the survey in English. Given that language of instruction can vary according to school type in Qatar (Zellman et al., 2009), it is possible their choice was influenced by their learning environment. Even for students who regularly learn in English, Mourtaga (2004) notes the Arab students who are learning English as a second language face many problems with reading and comprehension. Beaton and colleagues (2001) reason that inexperienced participants in a multi-linguistic setting may require far more cross-cultural adaptations.

Flaws in translation are difficult to detect, creating instances where erroneous conclusions can be drawn due to semantic inconsistencies rather than cultural differences (Sperber et al., 1994), there is a great need for guidelines to inform survey translation and validity determination. Findings from the present study indicate that across the survey languages and groups examined, only the Qatari and Non-Qatari Arab respondents, on the Arabic ASSASS, can be considered comparable. It could be argued that the protocol employed in the present study is excessive, or even unrealistic. We recognize that the procedures and considerations articulated in this study are not appropriate, or even feasible, for every study incorporating surveys in their design. Still, the naïve statistical procedures used to defend the translation of other attitudinal measures based on factor structure (e.g. Gencer & Cakiroglu, 2007) or scale reliability (e.g. Fraser, Aldridge & Adolphe, 2010; Navarro et al., 2016; Telli, 2006) are concerning, especially in the case of the latter because large studies generally have good reliability values.

Conclusions and Recommendations

Progressing as an interconnected global community offers an unprecedented opportunity to investigate questions, constructs, and variables of a related nature in many unique settings. As responsible researchers in the social sciences, we are tasked in making fair comparisons, drawing meaningful and defensible claims and recommendations, and disseminating results with confidence and clarity. When planning to conduct survey research between distinct groups, be it on students’ attitudes toward science or any number of other domains, it must be established that these groups can be meaningfully compared. Some limitations of comparisons, with respect to students’ attitudes toward science, have been noted, by Shrigley (1990) for example, but the temptation to make cross-cultural comparisons is, and continues to be, great. Other methodologies (e.g. open-ended questionnaire) have a more robust body of literature pertaining to translation and cross-cultural validity issues, but guidelines for survey research are less ubiquitous. To that point, consider that the efforts reported here represent an earnest attempt to navigate the methodological pitfalls associated with the translation, taking a number of established considerations into account (Harkness & Schoua-Glusberg, 1998). It is the recommendation of the authors that future efforts should report clear details about the translation process and prioritize establishing validity in the context(s) of data collection. We have demonstrated how the application of a systematic method using multi-group CFA can be used to help make defensible decisions regarding the comparison of groups. Following the example provided, for using the ASSASS in Qatar, the next steps in this process would be to further investigate and make judgements (e.g. modify or delete) about misfitting items to improve model fit in pursuit of equivalent survey performance to support valid cross-cultural investigations (see Squires et al., 2013).


  1. 1.

    Syntax used to generate the steps involved in comparison 1 is available as a supplement.

  2. 2.

    Although the sample of Non-Qatari Arabs (NQA) who responded to the MSA version was slightly underpowered, the model still demonstrated close fitting baseline CFA.

Supplementary material

10763_2018_9889_MOESM1_ESM.docx (20 kb)
ESM 1 (DOCX 19 kb)


  1. Abd-El-Khalick, F., Summers, R., Said, Z., Wang, S., & Culbertson, M. (2015). Development and large-scale validation of an instrument to assess Arabic-speaking students’ attitudes toward science. International Journal of Science Education, 37(16), 2637–2663.CrossRefGoogle Scholar
  2. Adolphe, F. (2002). A cross-national study of classroom environment and attitudes among junior secondary science students in Australia and in Indonesia (Doctoral dissertation). Curtin University, Australia.Google Scholar
  3. Ali, M. S., Mohsin, M. N., & Iqbal, M. Z. (2013). The discriminant validity and reliability for Urdu version of Test of Science-Related Attitudes (TOSRA). International Journal of Humanities and Social Science, 3(2), 29–39.Google Scholar
  4. Amer, S. R., Ingels, S. J., & Mohammed, A. (2009). Validity of borrowed questionnaire items: A cross-cultural perspective. International Journal of Public Opinion Research, 21(3), 368–375.CrossRefGoogle Scholar
  5. Beaton, D. E., Bombardier, C., Guillemin, F., & Ferraz, M. B. (2000). Guidelines for the process of cross-cultural adaptation of self-report measures. Spine, 25(24), 3186–3191.CrossRefGoogle Scholar
  6. Bentler, P. M. (2005). EQS 6.1: Structural equations program manual. Encino, CA: Multivariate Software, Inc..Google Scholar
  7. Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88(3), 588–606.CrossRefGoogle Scholar
  8. Blalock, C. L., Lichtenstein, M. J., Owen, S., Pruski, L., Marshall, C., & Topperwein, M. (2008). In pursuit of validity: A comprehensive review of science attitude instruments. International Journal of Science Education, 30, 961–977.CrossRefGoogle Scholar
  9. Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: Wiley & Sons, Inc..CrossRefGoogle Scholar
  10. Brewer, D. J., Augustine, C. H., Zellman, G. L., Ryan, G. W., Goldman, C. A., Stasz, C., & Constant, L. (2007). Education for a new era: Design and implementation of K–12 education reform in Qatar. Retrieved from MG548.pdf.
  11. Boone, W. J., Townsend, J. S., & Staver, J. (2011). Using Rasch theory to guide the practice of survey development and survey data analysis in science education and to inform science reform efforts: An exemplar utilizing STEBI self-efficacy data. Science Education, 95, 258–280.
  12. Borgers, N., De Leeuw, E., & Hox, J. (2000). Children as respondents in survey research: Cognitive development and response quality 1. Bulletin of Sociological Methodology, 66(1), 60–75.CrossRefGoogle Scholar
  13. Borgers, N., Sikkel, D., & Hox, J. (2004). Response effects in surveys on children and adolescents: The effect of number of response options, negative wording, and neutral mid-point. Quality and Quantity, 38(1), 17–33.CrossRefGoogle Scholar
  14. Brickhouse, N. W., & Potter, J. T. (2001). Young women’s scientific identity formation in an urban context. Journal of Research in Science Teaching, 38, 965–980.CrossRefGoogle Scholar
  15. Brislin, R. W., Lonner, W. J., & Thorndike, R. M. (1973). Cross-cultural research methods. New York, NY: Wiley.Google Scholar
  16. Brotman, J., & Moore, F. (2008). Girls and science: A review of four themes in the science education literature. Journal of Research in Science Teaching, 45(9), 971–1002. Scholar
  17. Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts, applications, and programming (2nd ed.). Mahwah, NJ: Erlbaum.Google Scholar
  18. Byrne, B. M. (2008). Testing for multigroup equivalence of a measuring instrument: A walk through the process. Psicothema, 20(4), 872–882.Google Scholar
  19. Campbell, A. A., & Katona, G. (1953). The sample survey: A technique for social science research. In L. Festinger & D. Katz (Eds.), Research methods in the behavioral sciences (pp. 14–55). New York, NY: Dryden.Google Scholar
  20. Central Intelligence Agency (2013). Qatar. In The world factbook. Retrieved from
  21. Chen, C., Lee, S. Y., & Stevenson, H. W. (1995). Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychological Science, 6(3), 170–175.CrossRefGoogle Scholar
  22. Cortina, J. (1993). What is coefficient alpha: An examination of theory and applications. Journal of Applied Psychology, 78, 98–104.CrossRefGoogle Scholar
  23. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychomerika, 16, 297–334.CrossRefGoogle Scholar
  24. Curebal F (2004). Gifted students’ attitudes towards science and classroom environment based on gender and grade level (Unpublished Master's Thesis). Graduate School of Natural and Applied Sciences at Middle East Technical University, Ankara, Turkey.Google Scholar
  25. Fraser, B. (1981). Test of science related attitudes. Melbourne, Australia: Australian Council for Educational Research.Google Scholar
  26. Fraser, B., Aldridge, J. M., & Adolphe, F. S. G. (2010). A cross-national study of secondary science classroom environments in Australia and Indonesia. Research in Science Education, 40, 551–571.CrossRefGoogle Scholar
  27. Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational research: An introduction. White Plains, NY: Longman.Google Scholar
  28. Gencer, A. S., & Cakiroglu, J. (2007). Turkish preservice science teachers’ efficacy beliefs regarding science teaching and their beliefs about classroom management. Teaching and Teacher Education, 23(5), 664–675.CrossRefGoogle Scholar
  29. General Secretariat for Development Planning (2010). Qatar national vision 2030. Doha, Qatar: Authors.Google Scholar
  30. Gokhale, A., Rabe-Hemp, C., Woeste, L., & Machina, K. (2015). Gender differences in attitudes toward science and technology among majors. Journal of Science Education and Technology, 24(4), 509–516. Scholar
  31. Guillemin, F., Bombardier, C., & Beaton, D. (1993). Cross-cultural adaptation of health related quality of life measures: Literature review and proposed guidelines. Journal of Clinical Epidemiology, 46, 1417–1432.CrossRefGoogle Scholar
  32. Harkness, J. A., & Schoua-Glusberg, A. (1998). Questionnaires in translation. ZUMA-Nachrichten Spezial, 3(1), 87–127.Google Scholar
  33. Harkness, J. A., Van de Vijver, F. J., & Mohler, P. P. (2003). Cross-cultural survey methods. Hoboken, NJ: Wiley-Interscience.Google Scholar
  34. Hayduk, L. A. (1987). Structural equation modeling with LISREL: Essentials and advances. Baltimore, MD: The Johns Hopkins University Press.Google Scholar
  35. Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18, 117–144.CrossRefGoogle Scholar
  36. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.CrossRefGoogle Scholar
  37. Jenkins, E. W., & Pell, G. (2006). The Relevance of Science Education Project (ROSE) in England: A summary of findings. Leeds, England: University of Leeds Centre for Studies in Science and Mathematics Education.Google Scholar
  38. Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (Eds.). (2007). Measuring attitudes cross-nationally: Lessons from the European Social Survey. London, England: Sage.Google Scholar
  39. Kellett, M., & Ding, S. (2004). Middle childhood. In S. Fraser, V. Lewis, S. Ding, M. Kellett, & C. Robinson (Eds.), Doing research with children and young people (pp. 161–174). London, England: Sage.Google Scholar
  40. Khalili, K. Y. (1987). A crosscultural validation of a test of science related attitudes. Journal of Research in Science Teaching, 24(2), 127–136.CrossRefGoogle Scholar
  41. King, G., Murray, C. J., Salomon, J. A., & Tandon, A. (2003). Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review, 97(04), 567–583.CrossRefGoogle Scholar
  42. Kline, P. (1979). Psychometrics and psychology. London, England: Academic Press.Google Scholar
  43. Liaghatdar, M. J., Soltani, A., & Abedi, A. (2011). A validity study of attitudes toward science scale among Iranian secondary school students. International Education Studies, 4(4), 36–46.CrossRefGoogle Scholar
  44. Lovelace, M., & Brickman, P. (2013). Best practices for measuring students’ attitudes toward learning science. CBE-Life Sciences Education, 12(4), 606–617.CrossRefGoogle Scholar
  45. Lowe, J. P. (2004). The effect of a cooperative group work and assessment on the attitudes of students towards science in New Zealand (Unpublished doctoral dissertation). Curtin University of Technology, Curtin, Australia.Google Scholar
  46. Lyons, T. (2006). Different countries, same science classes: Students’ experiences of school science in their own words. International Journal of Science Education, 28, 591–613.CrossRefGoogle Scholar
  47. McKay, R. B., Breslow, M. J., Sangster, R. L., Gabbard, S. M., Reynolds, R. W., Nakamoto, J. M., & Tarnai, J. (1996). Translating survey questionnaires: Lessons learned. New Directions for Evaluation, 70, 93–104.CrossRefGoogle Scholar
  48. Meredith, W. (1993). Measurement invariance, factor analysis, and factorial invariance. Psychometrika, 58, 525–542.CrossRefGoogle Scholar
  49. Mourtaga, K. (2004). Investigating writing problems among Palestinian students: Studying English as a foreign language. Bloomington, Indiana: Author House.Google Scholar
  50. Navarro, M., Förster, C., González, C., & González-Pose, P. (2016). Attitudes toward science: Measurement and psychometric properties of the Test of Science-Related Attitudes for its use in Spanish-speaking classrooms. International Journal of Science Education, 38(9), 1459–1482.CrossRefGoogle Scholar
  51. Nunnally, J., & Bernstein, L. (1994). Psychometric theory. New York, NY: McGraw-Hill Higher, INC..Google Scholar
  52. Osborne, J., Simon, S., & Collins, S. (2003). Attitude towards science: A review of the literature and its implications. International Journal of Science Education, 25(9), 1049–1079.CrossRefGoogle Scholar
  53. Pell, T., & Jarvis, T. (2001). Developing attitude to science scales for use with children of ages from five to eleven years. International Journal in Science Education, 23, 847–862.CrossRefGoogle Scholar
  54. Poortinga, Y. H. (1989). Equivalence of cross-cultural data: An overview of basic issues. International Journal of Psychology, 24(6), 737–756.CrossRefGoogle Scholar
  55. Potvin, P., & Hasni, A. (2014). Interest, motivation and attitude towards science and technology at K-12 levels: A systematic review of 12 years of educational research. Studies in Science Education, 50(1), 85–129.CrossRefGoogle Scholar
  56. Presser, S., Couper, M. P., Lessler, J. T., Martin, E., Martin, J., Rothgeb, J. M., & Singer, E. (2004). Methods for testing and evaluating survey questions. Public Opinion Quarterly, 68(1), 109–130.CrossRefGoogle Scholar
  57. Qatar Foundation (2009). Science and research. Retrieved December 6, 2009 from
  58. Rashed, R. (2003). Report on ROSE project in Egypt. Retrieved from
  59. Said, Z., Summers, R., Abd-El-Khalick, F., & Wang, S. (2016). Attitudes toward science among grades 3 through 12 Arab students in Qatar: Findings from a cross-sectional national study. International Journal of Science Education, 38(4), 621–643.CrossRefGoogle Scholar
  60. Santiboon, T. (2013). School environments inventory in primary education in Thailand. Merit Research Journal of Education and Review, 1(10), 250–258.Google Scholar
  61. Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353.CrossRefGoogle Scholar
  62. Schreiber, J. B., Nora, A., Stage, F. K., Barlow, E. A., & King, J. (2006). Reporting structural equation modeling and confirmatory factor analysis results: A review. The Journal of Educational Research, 99(6), 323–338.CrossRefGoogle Scholar
  63. Schreiner, C., & Sjøberg, S. (2004). Sowing the seeds of ROSE: Background, rationale, questionnaire development and data collection for ROSE (the Relevance of Science Education)—A comparative study of students’ views of science and science education. Oslo, Norway: University of Oslo, Department of Teacher Education and School Development.Google Scholar
  64. Sechrest, L., Fay, T. L., & Zaidi, S. H. (1972). Problems of translation in cross-cultural research. Journal of Cosscultural Psychology, 3(1), 41–56.Google Scholar
  65. Shrigley, R. L. (1990). Attitude and behavior correlates. Journal of Research in Science Teaching, 27, 97–113.CrossRefGoogle Scholar
  66. Sorbom, D. (1974). A general method for studying differences in factor means and factor structure between groups. British Journal of Mathematical and Statistical Psychology, 27, 229–239.CrossRefGoogle Scholar
  67. Sperber, A. D., Devellis, R. F., & Boehlecke, B. (1994). Cross-cultural translation. Journal of Cross-Cultural Psychology, 25(4), 501–524.CrossRefGoogle Scholar
  68. Squires, A., Aiken, L. H., van den Heede, K., Sermeus, W., Bruyneel, L., Lindqvist, R., . . . Matthews, A. (2013). A systematic survey instrument translation process for multi-country, comparative health workforce studies. International Journal of Nursing Studies, 50(2), 264–273.Google Scholar
  69. Streiner D.L., & Norman G.R. (1989). Health measurement scales: A practical guide to their development and use. New York, NY: Oxford University.Google Scholar
  70. Tavakol, M., & Dennick, R. (2011). Making sense of Cronbach's alpha. International Journal of Medical Education, 2, 53–55.Google Scholar
  71. Telli, S. (2006). Students’ perceptions of their science teachers’ interpersonal behaviour in two countries: Turkey and the Netherlands (Unpublished doctoral dissertation). The Graduate School of Natural and Applied Sciences, Middle East Technical University, Ankara, Turkey.Google Scholar
  72. The World Bank. (2008). The road not traveled: Education reform in the Middle East and North Africa. Washington, DC: Author.Google Scholar
  73. Tomas, L., & Ritchie, S. (2015). The challenge of evaluating students’ scientific literacy in a writing-to-learn context. Research in Science Education, 45(1), 41–58. Scholar
  74. Turkmen, L., & Bonnstetter, R. (1999). A study of Turkish preservice science teachers’ attitudes toward science and science teaching. Retrieved from ERIC database. (ED444828).Google Scholar
  75. Tytler, R., & Osborne, J. (2012). Student attitudes and aspirations towards science. In B. Fraser, K. Tobin, & C. J. McRobbie (Eds.), Second international handbook of science education (pp. 597–625). Dordrecht, Netherlands: Springer.CrossRefGoogle Scholar
  76. United Nations Development Programme. (2003). The Arab human development report: Building a knowledge society. New York, NY: UNDP regional Program and Arab Fund for Economic and social Development.Google Scholar
  77. Van de Vijver, F., & Leung, K. (1997). Methods and data analysis of comparative research. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.), Handbook of cross-cultural psychology: Theory and method (Vol. 1, 2nd ed., pp. 257–300). Needham Heights, MA: Allyn & Bacon.Google Scholar
  78. Vaske, J. J. (2008). Survey research and analysis: Applications in parks, recreation and human dimensions. State College, PA: Venture Publishing.Google Scholar
  79. Wang, J., & Wang, X. (2012). Structural equation modeling: Applications using Mplus. Hoboken, NJ: John Wiley & Sons, Inc..CrossRefGoogle Scholar
  80. Watkins, D., & Cheung, S. (1995). Culture, gender, and response bias: An analysis of responses to the self-description questionnaire. Journal of Cross-Cultural Psychology, 26(5), 490–504.CrossRefGoogle Scholar
  81. Webb, A. (2014). A cross-cultural analysis of the Test of Science Related Attitudes (Master’s thesis). The Pennsylvania State University, Pennsylvania, EE.UU. Retrieved from
  82. Widaman, K. F., & Reise, S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance abuse domain. In K. J. Bryant & M. Windle (Eds.), The science of prevention: Methodological advance from alcohol and substance abuse research (pp. 281–324). Washington, DC: American Psychological Association.CrossRefGoogle Scholar
  83. Zellman, G. L., Ryan, G. W., Karam, R., Constant, L., Salem, H., Gonzalez, G., . . . Al-Obaidli, K. (2007). Implementation of the K-12 education reform in Qatar’s schools. Santa Monica, CA: RAND Corporation.Google Scholar
  84. Zellman, G. L., Ryan, G. W., Karam, R., Constant, L., Salem, H., Gonzalez, G., . . . Al-Obaidli, K. (2009). Implementation of the K-12 education reform in Qatar’s schools. Santa Monica, CA: RAND Corporation. Retrieved from MG880.pdf.

Copyright information

© Ministry of Science and Technology, Taiwan 2018

Authors and Affiliations

  1. 1.Department of Teaching and LearningUniversity of North DakotaGrand ForksUSA
  2. 2.SRI InternationalWashington, DCUSA
  3. 3.School of EducationUniversity of North Carolina at Chapel HillChapel HillUSA
  4. 4.School of Engineering TechnologyCollege of the North AtlanticDohaQatar

Personalised recommendations