Multiple Group IRT Measurement Invariance Analysis of the Forms of Self-Criticising/Attacking and Self-Reassuring Scale in Thirteen International Samples

The purpose of this study was to examine the measurement invariance of the Forms of Self-Criticising/Attacking & Self-Reassuring Scale (FSCRS) in terms of Item Response Theory differential test functioning in thirteen distinct samples (N = 7714) from twelve different countries. We assessed differential test functioning for the three FSCRS subscales, Inadequate-Self, Hated-Self and Reassured-Self separately. 32 of the 78 pairwise comparisons between samples for Inadequate-Self, 42 of the 78 pairwise comparisons for Reassured-Self and 54 of the 78 pairwise comparisons for Hated-Self demonstrated no differential test functioning, i.e. measurement invariance. Hated-Self was the most invariant of the three subscales, suggesting that self-hatred is similarly perceived across different cultures. Nonetheless, all three subscales of FSCRS are sensitive to cross-cultural differences. Considering the possible cultural and linguistic differences in the expression of self-criticism and self-reassurance, future analyses of the meanings and connotations of these constructs across the world are necessary in order to develop or tailor a scale which allows cross-cultural comparisons of various treatment outcomes related to self-criticism.


Introduction
Excessive self-criticism is a personality vulnerability factor that can cause and sustain various psychological difficulties and disorders (e.g. Blatt and Shichman 1983;Blatt 2004;Falconer et al. 2015;Shahar et al. 2012). Blatt recognized the clinical significance of self-criticism and his work has influenced current understandings of the forms and impacts of self-criticism on mental health (Blatt et al. 1979). Self-criticism is generally viewed as a relatively stable and intractable personality style Zuroff et al. 2004). Zuroff et al. (2016) demonstrated that self-criticism displays both trait-like stability over time and a degree of variability over time reflecting state influences. In addition, selfcriticism is responsible for poor response to psychological treatment (Blatt et al. 1995;Blatt and Zuroff 2005;Horvath and Symonds 1991;Stinckens et al. 2013a, b). Several authors suggest there is a need for closer examination of self-criticism across cultures (Lau et al. 2010;Luyten and Blatt 2013). Developing a thorough understanding of self-criticism and designing sensitive tools to measure it will provide methods to evaluate tailored interventions across cultures.

Self-Criticism and Culture
Although the concept of mutual definition, influence and constitution between culture and self is old and pervasive (Kitayama 2016), research on cross-cultural comparisons of self-criticism is scarce. The majority of cross-cultural research operates within the self-construal theory of Markus and Kitayama (1991). A comparison of Western and Eastern conceptualizations of the self revealed that Western cultures tend to have independent self-construal because they construe the self as separate from its social context, emphasizing autonomy and independence. Importantly, it has been suggested that individuals in Western cultures focus on their abilities, traits, and needs, and they tend to prioritise their individual goals over those of in-groups. In contrast, individuals in Eastern cultures tend to have interdependent self-construal as they usually construe the self as an integral part of a broader social context and their concept of the self involves characteristics of their social environment. They are also suggested to have a sense of connectedness with others and to focus on their role in in-groups, while prioritising group goals over individual goals (Markus and Kitayama 1991).
According to Heine and Hamamura (2007), independent cultures might facilitate more self-enhancement by promoting a focus on inner attributes, while among interdependent cultures, reflection on the same inner attributes may foster self-criticism. A meta-analysis supports this view, with a large cross-cultural effect (d = .84) between East Asians and Westerners (Heine and Hamamura 2007). This cross-cultural difference was most prominent between the USA and Japan, with North Americans presenting as more self-enhancing whilst the Japanese as more self-critical (Heine et al. 2000). These findings are supplemented by research on topics explaining specific conditions in which the self-construal paradigm works. According to Kitayama et al. (1997), self-enhancement is defined as a general sensitivity to positive self-relevant information and self-criticism as a general sensitivity to negative self-relevant information. However, this definition of self-criticism is broad and vague. In comparison, a more precise definition of self-criticism has been offered by Blatt and Zuroff (1992) who characterized it as constant and harsh self-scrutiny and evaluation and feelings of unworthiness, inferiority, failure, and guilt.
The majority of cross-cultural research uses the definition of self-criticism provided from Kitayama et al. (1997), while those studies examining psychopathology generally use the definition offered by Blatt and Zuroff (1992). For example, Yamaguchi et al. (2014), found that in a sample of American students, independent self-perception was related to self-criticism but, in a sample of Japanese students, only interdependent self-perception was associated with high levels of self-criticism. Based on these findings, the authors argue that dominance of cultural self-perception is associated with self-criticism (Yamaguchi et al. 2014).
Another example of research on psychopathology is from , who investigated the moderating effect of fear of receiving compassion on the association between self-criticism and depression. This large international study included a large multi-cultural city in Canada and midsized cities in Canada, England, and Portugal. There was a positive association between self-criticism and depression but the effect was more prominent for individuals who reported high rather than low levels of fear of receiving compassion from others.
Nonetheless, these findings must be considered with caution since measurement invariance of the tools used to assess self-criticism has generally not been tested. Measurement invariance means that a construct measures same property in different groups, and it is a prerequisite for identifying meaningful cultural differences since it is an indication of the degree to which participants from different cultures interpret constructs in the same way. Lack of measurement invariance means that a test is biased: respondents with some level of a latent trait from one group provide systematically lower or higher responses than respondents with the same level of latent trait from another group, and this bias is induced by the test and does not express real differences.
Several different tools measuring self-criticism have been developed including the Depressive Experiences Questionnaire which assesses self-criticism, dependency, and self-efficacy (DEQ; Blatt et al. 1979), the Levels of Self-Criticism Scale (LOSC; Thompson and Zuroff 2004), the Forms of Self-criticising/Attacking & Self-Reassuring Scale (FSCRS; Gilbert et al. 2004), The Self-Critical Rumination Scale (Smart et al. 2016), and a situational measure labelled as The Self-Compassion and Self-Criticism Scales (SCCS; Falconer et al. 2015). Among the listed scales, to date only one study has reported measurement invariance in the LOSC (Thompson and Zuroff 2004) between Japanese and USA students (Yamaguchi et al. 2014). Although the psychometric features of the FSCRS have been thoroughly explored in different languages demonstrating good validity and reliability as well as consistent factor structure (Halamová et al. 2018), to our knowledge, no study to date has tested the cross-cultural measurement invariance of the FSCRS.

Aim of the Current Study
The present study investigates the measurement invariance of the dimensions of the FSCRS using Item Response Theory (IRT) differential test functioning using 13 samples from 12 different countries and eight language versions. The main objective of this study is to determine whether comparisons between total scores of the three dimensions of the FSCRS across countries and languages are appropriate and whether these findings about measurement invariance allow further cross-cultural research using the FSCRS.

Measure
The Forms of Self-criticising/Attacking & Self-Reassuring Scale (FSCRS; Gilbert et al. 2004) is a 22-item instrument, which was developed to assess levels of self-criticism and the ability to self-reassure when one faces setbacks and failure. Participants use a 5-point Likert scale to rate the extent to which various statements are true about them (1 = not at all like me; 5 = extremely like me). The scale comprises three subscales: Inadequate Self, which focuses on feelings of personal inadequacy, Hated Self measuring the desire to hurt or punish oneself, and Reassured Self which is an ability to reassure and support the self. Items for the three subscales are given in Table 1.
The construct validity of the FSCRS is evident when it is correlated with a onedimensional self-criticism measure like the DEQ (Blatt et al. 1979) and multidimensional measure like the LOSC (Thompson and Zuroff 2004). Correlations are in line with theoretical expectations, which indicate that all subscales of the FSCRS have good validity (Castilho et al. 2015;Gilbert et al. 2004;Halamová et al. 2017).
Some studies have demonstrated structural validity for the original three-factor solution of the FSCRS consisting of Hated self (HS), Inadequate self (IS) and Reassured self (RS) (Baião et al. 2015;Castilho et al. 2015;Kupeli et al. 2013). However, in more recent years research has favoured a two-factor solution consisting of self-criticism (IS + HS) and self-reassurance (RS), suggested merging the IS and HS subscales as a global measure of self-criticism in non-clinical populations (Gilbert et al. 2006a, b;Halamová et al. 2018;Halamová et al. 2017;Richter et al. 2009;Rockliff et al. 2011).

Sampling Procedure
To collate data from a variety of countries and cultures we used Google Scholar to identify publications which used the terms "the forms of self-criticising/attacking & self-reassuring scale" or "fscrs". We contacted the authors of all relevant publications which reported on samples of at least 220 non-clinical participants so as to enable the planned statistical methods. The planned statistical approach requires at least ten participants per item (Velicer and Fava 1998) and thus for the 22-item FSCRS, we required data from a sample of 220 participants. In addition, we found planned and not yet published research projects from the Compassionate Mind Foundation website (https ://compa ssion atemi nd.co.uk/uploa ds/files /resea rch-regis Table 1 Dimensions and scale items of The Forms of Self-Criticising/Attacking & Self-Reassuring Scale (Gilbert et al. 2004) Dimensions Scale items Self-criticism Inadequate self 1. I am easily disappointed with myself.

2.
There is a part of me that puts me down.
ter-for-websi te.pdf). Approximately 40 emails with requests for data were sent, from which thirteen data sets were received and included in the current analyses.

Sample Characteristics and Procedures
Out of eleven existing language versions of FSCRS currently available, this study includes data from eight (Halamová et al. 2018). The complete data set consists of five distinct English language samples from four different countries including Australia ( . In total, we tested thirteen distinct samples with an overall sample size of 7714. Sample characteristics for each of the samples are reported in Table 2. The data collected from these samples was in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Australia Sample
The participants were Australians selected from a larger sample of general population participants from several provinces (Kirby, personal communication). Convenience sampling was used to recruit participants to an online survey. Participants were students, who were recruited online through various university advertisements and the university pool of psychology research participants. Participants were required to be fluent in written English, and they received a small financial incentive or credit toward a course. The dataset comprised of data collected from various research studies Zuroff 2016, 2017;Zuroff et al. 2016).

Netherlands Sample
A convenience sample of participants was recruited by various undergraduate students in an online cross-sectional survey conducted by a university in The Netherlands (Sommers-Spijkerman et al. 2018). The accuracy of the Dutch version of FSCRS was verified using back translation.

Israel Sample
The Israeli sample consisted of participants from the general population who were recruited via an online survey platform and by undergraduate students from a private college (Shahar et al. 2015;Shahar, personal communication). The Hebrew version of FSCRS was not back translated.

Italy Sample
This study (Petrocchi and Couyoumdjian 2016) was conducted through an online survey and participants were recruited via both an Italian university students mailing list, and other professional mailing lists and web advertising. The Italian version of FSCRS was back translated.

Japan Sample
The research sample from Japan consisted of students undertaking a degree in Psychology at University (Kenichi, personal communication). The Japanese version of FSCRS was not back translated.

Portugal Sample
The research sample from Portugal included participants recruited through convenience sampling using an online platform from a university setting and from the general community (Gilbert et al. 2017). The Portuguese version of FSCRS was back translated.

Slovakia Sample
Data were collected gradually over 2 years within a research grant focused on selfcriticism and self-compassion (Halamová et al. 2017). Data were obtained by convenience sampling; questionnaires were distributed on paper and in an online form via social networks. The Slovak version of FSCRS was back translated.

Switzerland Sample
Participants were recruited in the German-speaking part of Switzerland through a study website and postings on internet forums (Krieger et al. 2016;Krieger, personal communication). The German version of FSCRS was back translated (Wiencke, personal communication).

Taiwan Sample
Participants from Taiwan were recruited from universities through social media and through word of mouth between students; they completed either an online survey or a paper and pencil version (Yu 2013). The Chinese version of FSCRS was back translated.

United Kingdom Sample 1
Participants from the first UK sample were recruited online from a university and the general population through social networking sites and health and well-being forums (Kupeli et al. 2013).

United Kingdom Sample 2
The second UK sample was recruited from an undergraduate course at a university. Participants completed pen and paper questionnaires. The dataset included data collected from various research studies (Baião et al. 2015;Gilbert et al. 2002Gilbert et al. , 2004Gilbert et al. , 2005Gilbert et al. , 2006aGilbert et al. , b, 2012Gilbert and Miles 2000).

USA Sample
The USA sample were students attending university (Gilbert et al. 2017). Participants were recruited via online participant management software. Psychology students received credits for their participation in the research study.

Data Analysis
In testing measurement invariance/equivalence, linear confirmatory factor analysis (CFA) is the common approach (Vandenberg and Lance 2000) in which some parameters (factor loadings, intercepts, residual variances) are constrained and subsequent loss of fit compared. Despite the advantages of IRT methods, these models are not used frequently to test measurement invariance. In the psychometric literature, there is an ongoing debate comparing these two approaches (Kankaraš et al. 2011;Kim and Yoon 2011;Meade and Lautenschlager 2004;Raju et al. 2002;Reise et al. 1993). While CFA models assume that the item responses are continuous and linear, IRT models assume the item responses are either nominal or ordinal. Unlike CFA models, IRT models are inherently non-linear with a logistic method of estimation. Furthermore, CFA models typically estimate a single intercept per item because they work on the assumption that the data are continuous. In contrast, IRT models typically compute multiple parameters (thresholds) analogous to item intercepts per item-for IRT models, the polychotomous data are categorical, and as a consequence IRT models usually result in greater sensitivity to morenuanced group differences such as in central tendency or the presence of extreme scores. Recent research shows that the IRT models can detect nonequivalence in the intercept (thresholds) and slope parameters both at the scale and the item level relatively accurately (Kankaraš et al. 2011). On the other hand, CFA performs well only when nonequivalence is located in the slope parameters, but wrongly indicates nonequivalence in the slope parameters when nonequivalence is located in the intercept parameters (Kankaraš et al. 2011). Some more advanced methods are available in CFA, especially the WLSMV estimator (weighted least squares means and variance adjusted), which estimates several thresholds instead of single intercept (Muthén 1993;Beauducel and Herzberg 2006), but comparisons of this method to the IRT approach are sparse (e.g. Kim and Yoon 2011). Recently, a new and promising method for testing measurement invariance has been proposed-the alignment method (Asparouhov and Muthen 2014), and we will use this approach to compare latent means across cultures. FSCRS subscales (Inadequate Self, Hated Self, and Reassured Self) considered individually are unidimensional and moreover they share considerable variance, as shown in previous research by means of non-parametric IRT Mokken scale analysis (Halamová et al. 2018). Previously we performed the analyses for each population separately and therefore these results provide no information about whether the test scores are comparable across different populations. IRT models are better equipped than linear CFA models to explore this issue. The CFA measurement invariance analyses provide insights regarding the relationship between latent factors, so their use is preferable when the goal is to answer questions on the invariance of a multifactorial framework. IRT analyses are suitable when testing the invariance of single, unidimensional scales such as Inadequate Self, Hated Self, and Reassured Self.
In the context of IRT models, measurement equivalence is tested by inspecting differential item functioning (DIF), and/or differential test functioning (DTF). Differential item functioning (DIF) means that an item within the FSCRS questionnaire measures the constructs (Inadequate Self, Hated Self, and Reassured Self) differently for one population when compared with another. As a consequence, the presence of DIF compromises test validity. If this item bias accumulates to the extent that it produces biased overall test scores, a test will also display differential test functioning (DTF). DTF is present when respondents who have the same level of the latent construct, but belong to different groups, obtain different scores on the test.
DIF is routinely tested during scale construction and usually some method of purification is adopted; items with DIF are flagged and removed. However, if a test has many items (e.g., FSCRS has 22 items) and only some of them have DIF (see DeMars 2011), then the impact of these DIFs on the overall test score may be negligible. Moreover, there could be large DIF effects in favour of one population for some items, but these effects could simultaneously cancelled out by DIF for other items in favour of other populations. Therefore, the presence of DIF for some items does not necessarily imply that the overall test itself is biased. On the other hand, it is also possible to have DTF in a situation where little or no DIF has been detected. Nontrivial DTF can occur in the case when the parameters systematically favour one group over another. Consequently, the aggregate of these small, nonsignificant differences at the item-level can become substantial at the test level (Chalmers et al. 2016). DTF is more relevant for our purpose than DIF; we do not intend to inspect particular items on FSCRS subscales nor do we intend to improve them. Rather, we intend to test the assumption that the (expected) total score of the FSCRS subscales is equivalent across different populations, and therefore only the latent trait-and not belonging to a particular group-has any impact on the (expected) total score. IRT methods are usually used to detect item bias (DIF), but for practical purposes, detecting the construct bias (DTF) is more useful; item bias could be large, with many items with DIF detected, but construct bias could be still negligible, with no DTF detected.
Testing the DTF involves two statistical measures (Chalmers et al. 2016). The first, the signed DTF tests whether there is any systematic scoring bias indicating that some groups consistently score higher across a specified range of the latent trait, and the second, the unsigned DTF, assess whether the test curves (plots of expected total score against a latent trait) have a large degree of overall separation on average, suggesting that there may be substantial DTF at particular levels of latent trait. The signed DTF values can range from -TS to TS (TS stands for the highest possible test score). Negative values of the signed DTF indicate that the reference group scores systematically lower than the focal group on average, while positive values indicate that the reference group scores higher. The unsigned DTF ranges from 0 to TS because the area between the two curves is zero when the test scoring functions have exactly the same functional form. The signed DTF values are always lower than or equal to the unsigned values, because when the curves do not cross, the signed DTF is equal to the unsigned DTF. If there is a small value for the signed DTF and a large value for the unsigned DTF, test curves intersect at one or more locations to create a balanced overall scoring, but there is substantial bias at particular levels of latent trait.
If there is substantial (significant) bias in the signed DTF, a FSCRS subscale is not invariant across countries; we cannot meaningfully compare test scores obtained from different countries, since the same values of test scores from different countries correspond to different levels of latent trait. This has many practical consequences, but the most important lesson is that that it is misleading to compare naively test scores from countries where the DTF was detected.
The alignment method (Asparouhov and Muthen 2014) tries to search for invariant item loadings and intercepts and in turn latent means and standard deviations using an alignment optimization function (e.g., a quadratic loss function). The advantage of this procedure is that all groups can be compared simultaneously, and it allows aligning and comparing latent means even if some loadings and intercepts are severely non-invariant. Its logic is similar to factor rotation; the function minimizes some non-invariances while leaving some of them large. A configural invariance CFA model is fitted, and its parameter estimates (factor loadings and intercepts) are used as input for the alignment procedure. Asparouhov and Muthen (2014) provide effect sizes of approximate invariance based on R 2 , and also the average correlation of aligned item parameters among groups. All aligned item factor loadings are approximately invariant (metric invariance) if the R 2 for factor loadings is close to 1 and the average correlation of aligned factor loadings is large. All aligned item intercepts are approximately invariant (scalar invariance) if the R 2 for intercepts is close to 1 and the average correlation of aligned intercepts is large.
Our analysis proceeded as follows: 1. Our procedure started with the identification of DIF, following which two randomly selected items with no DIF were used as anchors for DTF. If all items displayed DIF, two items were randomly selected as anchors for DTF (see Tables 3,  4 and 5). For DIF, we used the statistical program R (R Core Team 2017), package "lordif" (Choi et al. 2011). 2. We performed pairwise tests of DTF for all samples, separately for Inadequate Self, Hated Self, and Reassured Self. The total number of tests was 3*((13* 12)/2) = 234 (see Tables 6, 7 and 8). We used the statistical program R (R Core Team 2017), package "mirt" (Chalmers 2012). 3. For samples with nonsignificant sDTF, we also report latent mean differences and their confidence intervals. Latent means in the reference group (first row) were constrained to zero, and latent means in the focal group (first column) were estimated (slopes and thresholds of items were constrained to be equal across countries). It must be highlighted that the DTF provides no information concerning the differences between countries in total scores; DTF only tests the assumption that these groups could be meaningfully compared, i.e. that their comparison would not be distorted. Only invariant samples, with no DTF present, can be meaningfully compared. 4. For all groups, we performed the alignment method proposed by Asparouhov and Muthen (2014), implemented in the R package "sirt" (Robitzsch 2018). A configural invariance CFA model was fitted, and its parameters (factor loadings and intercepts for each group) were used as input for the alignment procedure. Effect sizes R 2 for aligned factor loadings and intercepts are reported, as well as average correlations of aligned factor loadings and intercepts. Latent means and standard deviations for each subdimension and country are reported.

Results
DIF testing showed (Tables 3, 4 and 5) that the number of items with DIF varied greatly among the samples, from no DIF detected to all items displaying DIF. The results suggest that the presence or absence of DIF is not a systematic predictor of DTF.
Out of 78 comparisons, there were 43 measurement equivalencies (DTF) for the Inadequate Self subscale (see Table 6), 61 measurement equivalencies for the Reassured Self subscale (see Table 7), and 65 measurement equivalencies for the Hated Self subscale (see Table 8). For the Inadequate Self subscale, the Australian sample was equivalent to 10 other samples, the Canadian, Italian, Slovak, UK1 and USA sample were equivalent to 8 other samples, the Netherlands, Portugal, Switzerland and UK2 samples were each equivalent to 7 other samples, the Japan and Taiwan samples were equivalent to 3 other samples, and finally the Israel sample was equivalent to 2 other samples. As for the Reassured Self subscale, Australian and Israel samples were equivalent to all 12 other samples, Canadian and Portugal samples were equivalent to 11 other samples, the Netherlands, UK2 and USA samples were equivalent to 10 other samples, Slovakia and Taiwan samples were equivalent to 9 other samples, Italy, Switzerland were equivalent to 8 other samples, UK1 sample was equivalent to 7 other samples, and finally Japan sample was equivalent to 5 other samples. For the Hated Self subscale, Italian, UK2 and USA samples were equivalent to all 12 other samples, Australia, Israel, the Netherlands and UK1 samples were equivalent to 11 other samples, Taiwan sample was equivalent to 10 other samples, Canadian and Portugal samples were equivalent to 9 other samples, Japan and Slovak samples were equivalent to 8 other samples, and finally Switzerland sample was equivalent to 6 other samples.
It should be noted that no transitivity can be assumed; for example, for the Hated Self subscale, both the Netherlands and Japan samples were equivalent to the Canadian sample, but they were not equivalent one to another. Therefore, we could not create a single linear rank based on the differences in latent means of equivalent samples, but rather clusters of mutually comparable samples. For example, again in the case of the Hated Self subscale, we could compare Canada, Australia, UK1, UK2, Israel, Japan and USA samples because all were mutually equivalent. However, adding another sample, for example, Switzerland was not possible: it was equivalent with all other samples, but not with Canadian sample. Therefore, we can compare the latent means of equivalent samples for each subscale (Tables 9, 10 and 11). These latent mean differences indicate that one population's answers are more or less self-critical than other's. Of course, we cannot exclude the possibility that answers from populations would not be significantly different. We note again that differences in latent means have nothing to do with and are orthogonal to measurement equivalence; rather, measurement equivalence is a necessary prerequisite for comparing latent means. Without the measurement equivalence, any comparison between two populations would be distorted by the differential functioning of the test itself and therefore could not represent differences in the latent trait.
Multiple Group IRT Measurement Invariance Analysis of the… Table 9 Latent mean differences of invariant samples for the Inadequate Self subscale  Figure 1 shows test score functions of Israel and Switzerland samples of Inadequate Self subscale (top), and Reassured Self subscale (bottom) -expected total scores plotted against the latent trait (θ). As far as Reassured Self subscale (bottom) is concerned, one can clearly see large differences between curves from − 4 to 0 values of θ, and then from 0 to 4, but in the opposite direction. Although the differences were very large (the unsigned DTF is 0.82, which was 2.57% of distortion), their impact on difference in expected total score (the signed DTF) was only 0.13 and non-significant at the 0.05 level. We could conclude that no significant DTF was present at the total score level. However, there were differences at particular levels of the latent trait (θ); respondents with very high and very low levels of Reassured Self responded differently in the Israel and Switzerland samples, but in opposite directions, so the effect was cancelled out. With regards to the Inadequate Self subscale (top), the situation was very different; again, we could see a large difference between the curves from 0 to 4 values of θ, but this difference was not compensated by the difference between − 4 to 0 in the opposite direction. The amount of differences was virtually the same as in the Reassured Self subscale (the unsigned DTF is 0.84, which was 2.34% of distortion), but the lack of compensation led to a larger impact on the difference in expected total score; the signed DTF is 0.67 and significant at the 0.001 level. Each curve had a 95% confidence interval envelope.
It is clear after inspection that even very large differences at particular levels of θ might have a negligible effect on differences in expected total scores if they were compensated after the intersection of test score functions. If test score functions did not intersect, the unsigned DTF is equal to the signed DTF; it means that the reference group scores were systematically lower (or higher) than the focal group across all the range of latent trait.
For the Inadequate Self subscale, the effect size R 2 for aligned factor loadings was 0.985, R 2 for aligned intercepts was 0.989, the average correlation of aligned factor loadings was 0.647 and the average correlation of aligned intercepts was 0.576. We can conclude that the alignment procedure successfully recovered approximate invariance. Latent means and their standard deviations are reported in Table 12.
For the Reassured Self subscale, the effect size R 2 for aligned factor loadings was 0.990, R 2 for aligned intercepts was 0.996, the average correlation of aligned factor loadings was 0.492 and the average correlation of aligned intercepts was 0.855. We can conclude that the alignment procedure successfully recovered approximate invariance. Latent means and their standard deviations are reported in Table 12.
For the Hated Self subscale, the effect size R 2 for aligned factor loadings was 0.991, the R 2 for aligned intercepts was 0.966, the average correlation of aligned factor loadings was 0.434 and the average correlation of aligned intercepts was 0.370. We can conclude that the alignment procedure successfully recovered approximate invariance. Latent means and their standard deviations are reported in Table 12.

Discussion
The present study used IRT differential test functioning to test the measurement invariance of the dimensions of the FSCRS using 13 samples from 12 different countries and eight language versions. The results demonstrate that in the majority of comparisons there is high measurement equivalence between the different countries suggesting that in general the FSCRS subscales are valid and reliable instruments with substantial cross-cultural potential. Nevertheless, some comparisons resulted in a lack of measurement equivalence and therefore displayed differential test functioning. Additional research would be necessary to determine whether this lack of measurement equivalence was caused by shifts in linguistic meaning, possible translation issues, by real differences in levels of self-criticism/reassurance across countries, by peculiarities in sampling procedures or by differences in gender or age between samples.
We have to stress that the IRT method (DTF) used in this paper detects construct bias, and not item bias: if some items are biased (DIF detected), it does not entail that construct bias (DTF) must follow necessarily-that would happen only if items were biased systematically in favour of one group. On the other hand, and even more importantly, there could be no substantive item bias (no items with DIF detected), but construct bias (DTF) could be present: small differences in functioning of particular items could be so systematic in favour of one group that they could distort the construct and its test score. These situations have clear practical consequences: in the first case, this method can save the validity of test score even if several items   display a substantive DIF (item bias); in the second case, this method can detect the problems with the test score (construct bias) even if no item displays a substantive DIF. Hated Self was the most invariant of the three subscales suggesting that selfhatred is quite similarly described across cultures. Also, Reassured Self was quite high in invariance which means that it too is quite analogous across cultures. The Inadequate Self subscale was the least invariant across cultures, suggesting that the experience and intensity of inadequacy could be very different across cultures. One possible source of the variance across countries and languages of Inadequate Self compared to Hated Self and Reassured Self could be the diversity of the standards prescribed for people in different cultures around world. According to our results, Israel, Japan and Taiwan are the countries with the most divergent perception of Inadequate Self. Japan and Taiwan scored the highest and Israel the lowest on the subscale of Inadequate Self. In contrast, Australia is the country with the most similar perception of Inadequate Self to the other countries assessed in this study.
Japan is the country with the most differing perception of Reassured Self, and Switzerland is the country with the most differing perception of Hated Self among the samples from different countries. In our research, the sample from Japan was the most self-critical (on both IS and HS) out of thirteen samples which confirms previous research suggesting that the Japanese population are more self-critical than North Americans (Heine et al. 2000). Also, our research findings support distinctions between Eastern and Western countries (Heine and Hamamura 2007), with countries located in the East (such as Japan and Taiwan) being more self-critical than countries located in the West. It is interesting that these differences between Japan and USA or East and West countries are present despite the use of an specific definition of self-criticism. We made no assumption that self-criticism is a general sensitivity to negative self-relevant information (Kitayama et al. 1997), but self-criticism is due to constant and harsh self-scrutiny and evaluation and feelings of unworthiness, inferiority, failure, and guilt (Blatt and Zuroff 1992). Interestingly, Taiwan is the second most self-critical country among the all analysed countries, but it is also quite high in self-reassurance. However, Israel is the most self-reassured and the least self-critical country.
The main limitation of our study was that FSCRS is a self-report tool, and therefore participants may have been influenced to respond in a socially desirable manner which may vary between cultures. Also, the samples were recruited mainly online but also in paper-pencil form, so different forms of obtaining data could influence the findings.
As self-criticism is a construct of high clinical importance, improving understanding of its cross-cultural similarities and differences, as measured by the three subscales of the FSCRS, would have great impact on practice. This is because a negative relation to oneself in the form of excessive self-critical inner dialogue is one of the most important psychological processes that influence susceptibility to, and persistence of, psychopathology (Falconer et al. 2015) and stress (Kupeli et al. 2017). Self-reassurance, which is closely related to self-compassion (Kupeli et al. 2013), is of great importance in its own right. And of course, it's the target in many outcome studies done worldwide (Kirby 2016), so we need to know about its measurement, too. So a tool which is sensitive or applicable to these small but important differences will be very useful in evaluating interventions. Thus, understanding the differences of self-criticism across countries can help to inform more effective practices in both medical treatment and psychotherapy.
We did not attempt to provide any systematic interpretation of these differences except for the differences between East and West countries and thus far more detailed research is required to do so. However, we could see that no discernible pattern emerged from mutually equivalent samples with cultural, linguistic or geographical continuum able to explain clusters of mutually equivalent countries.

Conclusion
This study contributes to the growing body of knowledge about the similarities and differences among cultures with respect to the three subscales of the FSCRS: Hated Self, Inadequate Self and Reassured Self. Our study revealed significant cross-cultural similarities and differences in the way these constructs are measured by the subscales of the FSCRS. Interestingly these differences are far larger for Inadequate Self than for Hated Self and Reassured Self, which seem to be quite invariant across cultures. One reason may be that self-hatred is tapping into a pathological dimension and self-reassurance is tapping into a health dimension that are indeed culturally invariant, whereas inadequate self is tapping into a competitive or social rank dimension that is more culturally bound. Hence, cultures that seem more collective may also be more sensitive to shame and stigma and the negative evaluation of others. This may partly explain why individuals from the Japanese culture report more self-criticism, because they may be more sensitive to social evaluation and social place. Although the items from the FSCRS are not related to specific events it may be that different types of events in different cultures are more susceptible to selfcriticism and this would need to be explored.
Although the FSCRS subscales are generally valid and reliable instruments with substantial potential for use cross-culturally, the three subscales were not perfectly invariant across all countries and groups. In view of the culturally and linguistically different expressions of self-criticism and self-reassurance that were observed, future cross-cultural testing of the meanings and connotations of these constructs is necessary. An important direction for future research is to investigate the factors responsible for the observed non-equivalences. Cross-cultural researchers must also continue to bear in mind that it is only possible to compare mean scores across countries which were found to be invariant.

Availability of Data and Materials
In order to comply with the ethics approvals of the study protocols, data cannot be made accessible through a public repository. However, data are available upon request for researchers who consent to adhering to the ethical regulations for confidential data.

Compliance with Ethical Standards
Conflict of interest The authors declare that they have no potential conflicts of interests.
Ethical Approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Informed Consent
Informed consent was obtained from all individual participants included in the study.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.