Applying Rasch Methodology to Examine and Enhance Precision of the Baby Care Questionnaire

The Baby Care Questionnaire (BCQ) is an established ordinal measure of parenting beliefs about infant care, which includes structure and attunement scales. Rasch analysis is a powerful method to examine and improve psychometric properties of ordinal scales. This study aimed to evaluate the psychometric properties and improve precision of the structure and attunement scales of the BCQ using Rasch methodology. A Partial Credit Rasch model was applied to a sample of 450 mothers from the United Kingdom ( n = 225) and New Zealand ( n = 225) who completed the BCQ. Initial analyses indicated acceptable reliability of the structure and attunement scales of the BCQ, but some items showed mis ﬁ t to the Rasch model associated with local dependency issues in both scales. After combining locally dependent items into testlets, both scales of the BCQ met expectations of the unidimensional Rasch model and demonstrated adequate and strong reliability and invariance across countries and person factors such as participants ’ age and their baby ’ s sex. This permitted the generation of conversion algorithms to transform ordinal scores into interval data to enhance the precision of both scales of the BCQ. In conclusion, using Rasch methodology, this study demonstrated robust psychometric properties of the structure and attunement scales of the BCQ after minor modi ﬁ cations. The ordinal-to-interval conversion tables published here can be used to further enhance the precision of the structure and attunement scales of the BCQ without changing their original response format. These ﬁ ndings contribute to the enhancement of precision in measuring parenting beliefs about infant care.

The Baby Care Questionnaire (BCQ) assesses parenting beliefs on infant care through structure and attunement scales.

•
Rasch methodology assessed BCQ's psychometric properties, refining structure and attunement scales for interval-level precision.
• By creating testlets, both BCQ scales demonstrated strong reliability and invariance across countries and person factors.
• Conversion algorithms converted ordinal scores to interval scores to enhance the precision of both scales of the BCQ.

Measurement of beliefs about parenting and what babies
need is important because these beliefs, and how they change during the transition to parenthood, are an important window into understanding parenting behaviours, wellbeing, and their impact on child outcomes (Galbally et al., 2018;Hughes et al., 2012;Smetana, 2017;Tikotzky & Sadeh, 2009).The Baby Care Questionnaire (BCQ; Winstanley & Gattis, 2013) is a valid and reliable psychometric assessment tool to measure parenting beliefs about infant care across fundamental care domains.However, this parenting belief measure is an ordinal scale, which has limitations (Tennant & Conaghan, 2007).Technically, scores collected from ordinal measurement are inappropriate for statistical parametric testing because ordinal data cannot be added, subtracted, multiplied or divided, so the calculation of means and standard deviations are invalid (Linacre, 2004;Merbitz et al., 1989;Norquist et al., 2004).Rasch methodology (Rasch, 1960(Rasch, , 1961) ) is a solution to enhance the psychometric properties of an ordinal scale to estimate an interval-level measurement if the strict assumptions of the unidimensional Rasch model are met (Hobart & Cano, 2009;Merkin et al., 2020;Tennant & Conaghan, 2007).This method can be used to develop ordinal-to-interval conversion algorithms to transform ordinal scores into interval-level scores, which provide a better way to monitor changes in parenting beliefs about infant care and more confidently use parametric statistics to evaluate the relationships between parenting beliefs and other key variables (Linacre, 2004).
The Baby Care Questionnaire The BCQ (Winstanley & Gattis, 2013) measures parenting beliefs about infant care using two scales, structure and attunement, across different contexts such as sleeping, feeding, and soothing.The attunement scale measures belief in the value of reading and responding to infant cues and identifying infants' needs and states (e.g., hunger versus satiety, distress versus calm).The structure scale measures belief in the value of regularity and routines in infant care (e.g., time schedules for breastfeeding and sleeping).Parents are asked to rate their agreement versus disagreement using a four-point Likert scale (1 = strongly disagree to 4 = strongly agree) on each individual item of the two scales, such as "When babies cry in the night to check if someone is near, it is best to leave them" (negatively coded attunement item) and "It is important to introduce a sleeping schedule as early as possible" (positively coded structure item).Studies demonstrated good psychometric properties of the BCQ (Gattis et al., 2022;Winstanley & Gattis, 2013).The exploratory factor analysis and confirmatory factor analysis confirmed the structural validity of the two independent scales of the BCQ which were attunement and structure (Winstanley & Gattis, 2013).Studies also demonstrated that each scale achieved strong reliability and validity as measures of parenting beliefs about infant care (Gattis et al., 2022;Mascheroni et al., 2022;Winstanley & Gattis, 2013;Winstanley et al, 2014).Specifically, research indicated that both structure and attunement belief scores captured by the BCQ were related to the frequency and duration of responsive and demanding parenting behaviours (Gattis et al., 2022).In addition, attunement and structure beliefs were related to parenting experience.For attunement, pregnant women who were experienced mothers (i.e., already had at least one child) had higher scale scores compared to pregnant women expecting their first child, while for structure, pregnant women expecting their first child had higher scale scores compared to pregnant women who were experienced mothers (Mascheroni et al., 2022).
Even though both scales of the BCQ are well-validated measures of parenting beliefs about infant care, these scales are still ordinal measures and therefore have some limitations (Tennant & Conaghan, 2007).That is, the distance between ordinal response categories of individual items, such as 1 and 2 versus 2 and 3, are not the same, meaning that an ordinal scale is unable to reflect actual change as accurately as an interval scale does (Masters, 1982;Truong et al., 2021).In addition, when requiring parametric statistics (e.g., means and standard deviations), ordinal scores cannot be used as they do not meet their arithmetic assumptions.Studies have shown that using parametric statistics with ordinal scores raises concerns about whether correct inferences can be drawn, which potentially impairs the control of Type I and Type II errors, and statistical power (Zumbo & Zimmerman, 1993;Verhulst & Neale, 2021).The use of interval-level data can minimise these concerns by improving reliability and internal validity of measurement (Jamieson, 2004;Merbitz et al., 1989).Rasch analysis (Rasch, 1960) is a method to resolve such issues and has been increasingly used to investigate psychometric properties of measures and improve their accuracy to approximate an interval-level scale (Hobart & Cano, 2009;Lundgren-Nilsson & Tennant, 2011).

Rasch Analysis
Unlike other statistical methods, such as classical test theory and generalisability theory (Cronbach et al., 1963), Rasch methodology can accurately estimate the unique contributions of individual items to the overall latent variable (e.g.structure or attunement beliefs) based on sample parameters (Fox & Jones, 1998;Rasch, 1960Rasch, , 1961)).A Rasch model is unidimensional and assumes that the response to a specific scale item is a function of that item's difficulty and its respondent's ability (Rasch, 1960(Rasch, , 1961)).Rasch analysis also assumes scale invariance, which is tested by investigating Differential Item Functioning (DIF) due to personal characteristics (e.g., mother age, baby sex).DIF is useful to test whether an item of the measure works equally well across sub-groups within the population (Hagquist & Andrich, 2004).
Importantly, when the Rasch model fit is satisfactory, the ordinal scores of a psychometric measure can be converted into interval-level scores (Barber et al., 2022;Linacre, 2004;Norquist et al., 2004;Truong et al., 2023).This is the key advantage of the Rasch model over classical test theory methods because the interval-transformed data will reflect changes on a latent trait more accurately, similar to other interval-level scales such as blood pressure or height (Hobart & Cano, 2009;Rasch, 1960;Tennant & Conaghan, 2007;Wilson, 2004;Wright & Stone, 1979).Additionally, an item-person threshold distribution plotted from the Rasch analysis is another useful tool.This graph is useful to detect possible significant ceiling or floor effects (Medvedev & Krägeloh, 2022).Thus, Rasch analysis can be considered the most advanced statistical methodology to precisely evaluate the reliability and validity of an ordinal measure, as well as enhance its precision to approximate an intervallevel scale.

Our Study
Our study used Rasch methodology to evaluate the psychometric properties of the structure and attunement scales of the BCQ and enhance their precision in assessing parenting beliefs about infant care.Our study also generated conversion tables to transform ordinal structure and attunement scores to approximate interval-level scales.

Participants
The optimal sample size for Rasch analysis using RUMM2030 software is between 250 and 500 cases to minimise both Type I and Type II errors (Hagell & Westergren, 2016).Hagell and Westergren (2016) suggested if the sample size is larger than 500 cases, it has been shown to inflate chi-square statistics leading to Type I error.Conversely, Type II error is likely if the samples are below 250 cases due to limited information for item calibration.These mean that sample sizes around n = 250 to n = 500 are effective in balancing the statistical interpretation of RUMM fit statistics, particularly for minimising Type I and Type II errors under the assumption of Rasch model fit.
To achieve an optimal sample size for Rasch analyses in order to investigate scale invariance between countries, we randomly selected 225 participants from each sample.Our study, with a sample size of 450 for Rasch analysis, aligns with this recommendation and is therefore well-positioned to provide reliable Rasch analysis results using the RUMM2030 software.Our sample size estimates, while based on dichotomous scales investigated by Hagell and Westergren (2016), are applicable to polytomous scales as the core principles of Rasch analysis and the Chi-square statistics used in RUMM2030 for assessing model fit are fundamentally similar across both scale types, focusing on the relationship between item difficulty and respondent ability irrespective of the number of response options.
The demographic details of the participants in Rasch sample and subsamples of each country, and the results of chi-square and statistical tests comparing differences in these demographics between subsamples are presented in Table 1.Some demographics were missing in the Rasch sample, but they were negligible.These included two mothers who did not report their child's birth order, 13 who did not report their child's age, and 4 who did not report their child's gender.Statistical comparison tests indicated that there were no significant differences for mother's age or child gender ratio between the randomly selected UK and NZ subsamples.However, in the UK subsample, there were more first-time parents and more parents of younger infants compared to the NZ subsample.
Figure 1 presents the CONSORT diagram of how participants were selected for Rasch analyses from two studies: one conducted in the United Kingdom (UK) and another in New Zealand (NZ).A CONSORT diagram was used as a standardised visual representation that can display the flow of participants through different samples.It provides a comprehensive overview of participant enrolment, allocation, follow-up, and analysis.The UK sample consisted of 656 mothers who were recruited through advertisements on parenting websites and social media including BabyCentre, Facebook, and Twitter.UK participants completed a Qualtrics survey that included brief demographics and the Baby Care Questionnaire, and they were rewarded with a

Measure
The Baby Care Questionnaire (BCQ; Winstanley & Gattis, 2013) measures parenting beliefs about infant care using two scales: attunement and structure.Each individual item of the BCQ uses a four-point Likert scale from 1 = strongly disagree to 4 = strongly agree.The structure scale consisted of 17 items and the structure score was the mean of the structure items.The structure scale showed strong internal consistency, with McDonald's omega above 0.88 in the UK and NZ samples.The attunement scale consisted of 13 items and the attunement score was the mean of the attunement items.The attunement scale showed strong internal consistency, with McDonald's omega of above 0.83 in the UK and NZ samples.

Data Analyses
IBM SPSS v.28 was used to obtain descriptive statistics including mean and standard deviation (SD), as well as to compute McDonald's omega for the structure and attunement scales of the BCQ because McDonald's omega was not available in previous versions of IBM SPSS.RUMM2030 software package (Andrich et al., 2009) was used to conduct Rasch analyses based on the standardised criteria for the Rasch model fit as recommended elsewhere (Leung et al., 2014;Tennant & Conaghan, 2007).
Prior to Rasch analysis, the likelihood-ratio tests were conducted to examine distances between thresholds of individual items because individual items of the BCQ were polytomous.If likelihood-ratio tests showed significant differences between thresholds across individual items of both BCQ scales, the Partial Credit model (Masters, 1982) should be applied (Lundgren-Nilsson & Tennant, 2011).Otherwise, the Rating Scale model (Andrich, 1978) would be more appropriate (Tennant & Conaghan, 2007).The difference between these two polytomous-item Rasch models is that the Partial Credit model is unrestricted and assumes that each item of a scale has its own response category structure, whereas the Rating Scale model assumes that response options across all individual items of a scale have the same rating response category structure (Linacre, 2000).
Rasch analysis was then conducted iteratively until achieving the best model fit.There are several requirements for the overall Rasch model fit that generally describe parameters of an interval scale.The first requires a nonsignificant chi-square goodness of fit estimate of item-trait interaction (p > .05)as this indicates a summary of the individual item fit (Linacre, 2002).The second requires no Fig. 1 CONSORT diagram for participants selected for Rasch analysis of the structure and attunement scales of the BCQ item misfit, which can be detected by observing fit residuals for individual items that should be in the range of ±2.50 (Lundgren-Nilsson & Tennant, 2011).The third requires no local dependency between items and can be verified by examining the residual correlations between individual items where expected values should be below 0.20 (Christensen et al., 2013).The fourth requires invariance or no DIF due to personal factors (e.g.age, sex) to ensure individual items working equally well across different groups of people (Hagquist & Andrich, 2004).Lastly is the requirement of unidimensionality, which is typically examined by the principal components analysis (PCA) of the residuals and the equating t-test (Leung et al., 2014).T-tests were conducted to compare person estimates for two sets of items, one with the highest and another with the lowest loadings, on the first principal component of residuals.Unidimensionality is evident when the amount of significant t-test comparisons does not exceed 5% or the lower bound of the binominal confidence interval calculated for significant t-tests is at or below 5% (Smith, 2002).Additionally, the reliability estimate Person Separation Index (PSI) used in Rasch analysis, although not a criterion for the Rasch model fit, reflects how well the measure discriminates between individuals with different levels of the latent trait (e.g., attunement beliefs).The interpretation of PSI is similar to Cronbach's alpha or Omega, with values of 0.70-0.79indicating acceptable reliability and 0.80 and higher indicating good to excellent reliability (Andrich et al., 2009;Medvedev et al. 2018aMedvedev et al. , 2018b)).
To ensure the consistency of item responses across various personal factors such as country, language version, or gender, we conducted Differential Item Functioning (DIF) analysis.DIF occurs when participants belonging to different sub-groups (such as individuals from different countries), despite having comparable levels of the underlying trait (like structure or attunement), exhibit distinct responses to the same item (Medvedev & Krageloh, 2022).The process of DIF testing involved the use of group-based ANOVA on standardised residuals and Bonferroni-adjusted t-tests for specific subgroup comparisons, as suggested by Andrich & Marais (2019).Furthermore, we visually inspected individual item plots, namely the item characteristic curves (ICCs).In cases where significant DIF was detected for any item or testlet (as detailed in the following paragraph), a thorough visual analysis of the plot was undertaken to identify whether the DIF was uniform (indicating negligible group overlap across all trait levels) or nonuniform (indicating notable overlap).
In this study, we considered combining individual items into testlets to reduce measurement error due to local dependency, DIF and item misfit and to improve the overall Rasch model fit; this is a well-established methodology (Lundgren-Nilsson et al., 2013;Medvedev et al., 2018aMedvedev et al., , 2018b;;Merkin et al., 2020;Truong et al., 2023;Truong, Numbers, et al., 2023).Testlets were created and combined based on local dependency reflected by residual correlations exceeding 0.20 in magnitude because local dependency produces spurious correlations and artificially affects model fit and dimensionality.We have considered the magnitude of residual correlations on the first place when creating testlets.The effect of combining individual items into testlets to reduce undesired error variances is similar to the signal-averaging technique used to study event-related potentials using electroencephalogram (EEG).In this technique, the test is repeated several times until error variances nullify each other, allowing researchers to identify the relevant neural response (Luck, 2014).Similarly, instead of replicating the test, combining multiple items that measure the same construct can eliminate unwanted variances that are not related to the construct being measured (Truong et al., 2023).Therefore, if the best fit to the Rasch model is achieved after creating testlets, it indicates that measurement errors were successfully decreased using this approach.
When the data fit the Rasch model, the person-item threshold distribution was examined to see how well item thresholds of each BCQ scale covered the levels of parenting beliefs about infant care in the sample.Finally, conversion tables for the structure and attunement scales of the BCQ were generated to convert ordinal raw scores into interval-level data.

Results
The Unrestricted Rasch Partial Credit model (Masters, 1982) was used for the data of this study because likelihood-ratio tests showed significant differences between thresholds across individual items of both BCQ scales (p's < 0.001).

Structure Scale
Table 2 displays the overall model fit estimates of the initial and final Rasch analyses of the structure scale of the BCQ.As can be seen, the initial analysis indicated reasonable reliability (PSI = 0.67) for the structure scale but the overall fit to the Rasch model was unacceptable due to significant chi-square, χ 2 (153) = 626.01,p < 0.001, and lack of evidence for unidimensionality (13.1% of significant t-tests).Table 3 displays individual item statistics including location, item-fit residual, and chi-square values for the initial analysis of the structure scale of the BCQ.There are two items with significant misfit to the model: item S5 and S9.
The residual correlation matrix revealed local dependency for several groups of items as indicated by their  ).This analysis resulted in acceptable unidimensionality (5.0% of significant t-tests) with retained high common variance in the estimate of A = 1.16, supporting unidimensionality of the obtained bifactor solution.This analysis also increased reliability (PSI = 0.78), as well as no local dependency, no item misfit (see Table 4, Structure) and no DIF by personal factors.Therefore, this final analysis achieved the best Rasch model fit for the structure scale of the BCQ.The person-item threshold distribution for the structure scale from the analysis of the best model fit (Final analysis) is presented in Fig. 2.This shows that the structure scale's thresholds satisfactorily cover the levels of parenting beliefs about infant care in the sample and there are no significant ceiling or floor effects.Table 5 displays the ordinal-tointerval conversion table for the structure scale of the BCQ based on person estimates of the final fit to the Rasch model.

Attunement Scale
Table 2 also shows the overall model fit estimates of the initial analysis, and final Rasch analyses of the attunement scale of the BCQ.As can be seen, initial analysis indicated strong reliability of PSI = 0.85 for the attunement scale but the overall fit to the Rasch model was unacceptable due to significant chi-square, χ 2 (117) = 602.95,p < 0.001, and a  lack of unidimensionality (13.1% of t-tests were significant).Table 3 also displays statistics of item location, item-fit residual, and chi-square values for the initial analysis of the individual items in the attunement scale.There were several items with significant misfit to the model: item A1, A2, A4, A7, A10, A12, and A13.In addition, most attunement items had DIF issues by country factor, except the attunement item A12.Similarly, to the structure scale analysis, we applied the testlet approach as it can resolve both item misfit and DIF.Two testlets of the attunement scale were thus created by considering residual correlations between items: Testlet 1 (items A1, A4, A6, A7, A8, A9, A12, & A13), and Testlet 2 (A2, A3, A5, A10, & A11).This resulted in the best overall model fit (Table 2, Attunement, Final analysis) with no local dependency, no item misfit (see Table 4, Attunement), no DIF amongst testlets, acceptable unidimensionality (4.9% of significant t-tests) and retained large common variance in the estimate of A = 1.08, supporting unidimensionality of the current bifactor solution with testlets, as well as improved reliability (PSI = 0.92) compared to the initial analysis.
The person-item threshold distribution for the attunement scale from the analysis of the best model fit (Final analysis) is presented in Fig. 3.This shows that the attunement scale's thresholds satisfactorily cover the levels of parenting beliefs about infant care in the sample and there are no significant ceiling or floor effects.Table 6 displays the ordinal-to-interval conversion table for the attunement scale of the BCQ based on person estimates of the final fit to the Rasch model.

Rasch Interval-Transformed Scores
Table 7 includes the range scores, means and standard errors of the BCQ ordinal raw scores and intervaltransformed Rasch scores for the UK and NZ original samples.As can be seen, all standard errors of ordinal raw scores of both structure and attunement scales were consistently higher compared to Rasch transformed scores in both original samples.In addition, there were significant differences on attunement ordinal raw scores and intervaltransformed Rasch scores between the original UK and NZ samples (p's < 0.001).However, attunement intervaltransformed scores (Cohen's d = 2.69) showed lower effect size compared to ordinal raw scores (Cohen's d = 3.44).These findings together illustrate that Rasch transformed scores indicate a reduction of the magnitude of the difference and thus this will derive from the smaller standard error.

Discussion
This study utilised Rasch methodology to evaluate and improve the psychometric properties of the structure and attunement scales of the Baby Care Questionnaire (BCQ), a well-established ordinal measure of parenting beliefs about infant care.The use of Rasch methodology allowed for the identification of locally dependent items that were combined into testlets to enhance the precision of both scales of the BCQ.We demonstrated that these two BCQ scales, reorganised into testlets, met expectations of the Rasch model, which then allowed us to enhance the accuracy of each scale comparable to interval-level measurement using the ordinal-to-interval conversion algorithms presented in Tables 4 and 5 for the structure and attunement scales respectively.The resulting conversion algorithms can be used to transform the ordinal scores obtained from the BCQ into interval data, which can increase the precision of the measure without changing its original response format.This can be particularly useful in longitudinal studies that track changes in parenting beliefs over time, as well as in cross- cultural research that compare parenting beliefs across different populations.These ordinal-to-interval conversions are important because individual items of each scale contribute differently to the total score of each scale, which should be accounted for (Stucki et al., 1996).Sandham et al. (2019) used a metaphor of squeezing Vitamin C from different fruits to explain how the Rasch model works.Each fruit represents an item on the scale, and each fruit contains differing levels of Vitamin C, which represents the latent trait being measured (e.g., parenting beliefs about infant care).Just as different fruits contribute different amounts of Vitamin C to a smoothie, different items on a scale contribute different amounts of the latent trait to the overall score.Rasch analysis allows for the measurement of the latent construct to the extent that it is contained within each item while filtering out other constructs, manifesting as fit residuals in the model.This increases the precision of measurement and allows for comparisons of accuracy between original ordinal-level scores and their Rasch-converted interval equivalents (Norquist et al., 2004).
Literature reviews on Rasch methodology concluded that using Rasch interval-transformed scores can reduce measurement error associated with ordinal scores (Truong et al., 2023).Therefore, interval-transformed scores may reflect the levels of parenting beliefs about infant care more accurately, as well as avoiding the violation of arithmetic assumptions when conducting parametric statistical tests (Leung, 2011).In addition, achieving adequate reliability (PSI = 0.78) for the structure scale and excellent reliability (PSI = 0.92) for the attunement scale adds empirical evidence to support the robust reliability of these BCQ scales in measuring parenting beliefs about infant care.Intervaltransformed data can also be appropriate for conducting statistical comparisons with other interval data, such as electrophysiological or neuroimaging data, contributing to improvement of reliability and validity of research results.Therefore, the assumptions of parametric statistical tests can be met using the interval BCQ data (Leung, 2011).Transforming the ordinal scores of both structure and attunement scales of the BCQ into interval scores does not require users to be experts in statistics because we have developed conversion algorithms based on Rasch model estimates.To interpret ordinal-level scores for each scale in our tables (i.e., structure scale in Table 5, attunement scale in Table 6), researchers can use corresponding interval scale scores found on the right-hand side of each table (Tables 5 and 6).
The findings from this study have significant implications for research in the field of child development and developmental psychology.Parenting beliefs about infant care can potentially influence parental behaviours, which can in turn impact child development.Therefore, having a reliable and valid measure of parenting beliefs is crucial for understanding the complex interactions between parenting beliefs and behaviours and child outcomes.After establishing scale invariance, we have confirmed that NZ parents score significantly higher on attunement beliefs compared to the UK sample, while there were no significant differences on structure beliefs between countries.This finding may be due to demographic differences in the samples as there were significantly more first-time mothers in the UK sample.Previous research indicates that experienced mothers report stronger beliefs in attunement compared to first-time parents (Mascheroni et al., 2022).
The main strength of this study was the application of robust Rasch methodology to evaluate and enhance the psychometric properties of the structure and attunement scales of the BCQ using a randomly selected subsample from two countries (UK and NZ), and then the sample size (n = 450) satisfied the optimal sample size for Rasch analyses to minimise Type I and Type II errors (Hagell & Westergren, 2016).In addition, the initial analyses detected several items has DIF issues by personal factors (i.e., mother age, infant age, and infant gender), and especially by countries.However, this study found that the modified structure and attunement scales of the BCQ are invariant, working equally well across personal factors (i.e., mother age, infant age, and infant gender), as well as across NZ and UK mothers, which had not been investigated in previous studies.Scale invariance (no DIF) refers to the property of a measurement tool that ensures that the relationships between the items on the scale are consistent across different groups or populations.In other words, it ensures that the same construct is being measured in the same way across different groups, regardless of their culture, language, or background.When a measurement tool is invariant across groups, it means that the scores obtained from different groups can be compared and interpreted meaningfully.Moreover, in our analysis, we observed that interval-transformed Rasch scores showed a smaller effect size compared to ordinal raw scores.This observation reflects the impact of Rasch transformation on the standard deviation component of effect size calculation, rather than indicating a direct reduction in measurement error.Rasch transformation, by converting ordinal data to an interval scale, affects the distribution and variability of scores, which in turn influences the effect size.This transformation enhances the interpretability and linearity of the scores but does not inherently imply a reduction in measurement error.
We noted that there were significant differences on both ordinal raw scores and interval-transformed Rasch attunement scores between the original UK and NZ samples with interval-transformed attunement scores showed smaller effect size compared to its ordinal raw scores.However, there were no significant differences on both ordinal raw scores and interval-transformed Rasch structure scores between the original UK and NZ samples with intervaltransformed structure scores showed slightly lower effect size compared to its ordinal raw scores.These may indicate that structure is more similar across cultures (UK vs NZ) compared to attunement, where cultural differences may affect the accuracy of the ordinal assessment scale.However, invariance was established by Rasch modifications for both scales, which is reflected by differences in effect sizes between ordinal and Rasch scores for attunement but not for structure.
When we compare parenting beliefs about structure and attunement between different countries, we need to ensure that the measurement tool (e.g., BCQ) is scale-invariant, so that we can trust that any differences between countries are real and not due to measurement bias.Even if the mean scores between countries are significantly different, we cannot conclude that there is a real difference unless we can demonstrate that the measurement tool is invariant across countries, which we achieved in this study.For example, suppose we are comparing the parenting beliefs of parents from two different countries (UK and NZ) using a scale that has been shown to be invariant across these cultures.If the mean score for parents from the UK is significantly higher than that of parents from NZ, we can interpret this difference as reflecting a real difference in parenting beliefs between the participants from the two samples.As acknowledged earlier, there were more experienced mothers in the NZ sample, potentially accounting for the higher attunement scores in the NZ sample.Therefore, the country difference observed in this study may not reflect a real difference between NZ and UK mothers at large.However, if the scale is not invariant across cultures and DIF occurs, then any differences in mean scores may be due to measurement bias rather than actual differences in parenting beliefs between samples or countries.Scale invariance is essential when comparing parenting beliefs or any other constructs across different cultures or populations.Only when we can demonstrate that the measurement tool is invariant across cultures can we reliably compare mean scores between groups and draw meaningful conclusions about differences in parenting beliefs.
This study is not without limitations.Participants in this study came from convenience samples in two countries and may not be representative of mothers within those two countries.In addition, there are other dimensions that might be relevant to parenting beliefs, besides parent and infant age and infant gender.Therefore, replications of this study should be conducted in samples from other Englishspeaking countries (and possibly involving translations to other languages), and should consider other potentially relevant personal factors, such as differences among cultures, parenting experience, levels of educational attainment, partnership status of parents, and gender of parents.
In conclusion, the findings of this study demonstrated the reliability and internal validity of both scales of the BCQ in measuring parenting beliefs about infant care.Our minor modifications implemented in the scoring of both scales of the BCQ satisfied expectations of the unidimensional Rasch model and scale invariance across different country samples, infant and parent age and infant gender, as well as resolved local dependency issues.This allowed us to enhance the accuracy of each scale to approximate intervallevel scores using the ordinal-to-interval conversion tables.Researchers can use the conversion tables published here to enhance precision of scores on the structure and attunement scales of the BCQ without changing their original questionnaire format.Overall, the findings from this study contribute to the enhancement of precision in measuring parenting beliefs about infant care, which is important for understanding the complex interactions between parents and their infants and for developing interventions to promote positive parenting practices and support healthy child development.
residual correlations above 0.20.In addition, many structure items had DIF issues and especially DIF by country in several structure items (i.e., items S1, S4, S5, S9, S11, S13, S14, and S16).This means that such structure items did not work equally well in both the UK and NZ subsamples.Local dependency between items affects the model fit and may impact on DIF or scale invariance(Christensen et al., 2013).Several studies have demonstrated the impact of local dependency on Rasch model fit and DIF.For example,Adams and Wu (2002) showed that local dependency can lead to inflated estimates of item and person parameters, and can result in spurious DIF.Similarly, Linacre (2006) demonstrated that local dependency can lead to biased item difficulty estimates and incorrect classification of persons.To resolve local dependency issues, three testlets were created: Testlet 1 (items S3, S4, S5, S9, S10 & S11); Testlet 2 (items S2, S8, & S16) and Testlet 3 (items S1, S6, S7, S12, S13, S14, S15, & S17

Fig. 2
Fig. 2 Person-item threshold distribution of the model fit analysis of the structure scale of the BCQ

Fig. 3
Fig. 3 Person-item threshold distribution of the model fit analysis of the attunement scale of the BCQ

Table 1
Demographic details of participants for Rasch sample and the UK and NZ random subsamples including their statistical comparisons using Chi-square (p-values) for categorical variables and t-tests (p-values) for continuous variables a Statistical comparison tests conducted between UK and NZ random subsamples only small gift such as a baby toy or t-shirt.Out of the UK sample, the 468 mothers who completed the BCQ were 18-46 years old with a mean of 31.02 years (SD = 5.36).The NZ sample included 792 individuals who started a large online survey focusing on parenting around infant sleep.NZ participants were primary caregivers of an infant under two years old and were recruited via social media posts and links disseminated in news media stories.Those who were non-NZ residents (n = 40) or non-mothers (fathers, grandparents, and other caregivers, n = 12) were omitted from the sample.Out of the remaining NZ participants, 650 resident mothers who completed the BCQ were aged 19-42 with a mean age of 31.43 (SD = 4.95).All study procedures were reviewed and approved by authors' institution research ethics committee, which are the internationally accepted ethical standards.All participants provided informed consent.

Table 3
Rasch model fit statistics of item-fit residuals for the initial analysis of the individual items in the structure and attunement scales of the BCQ Summary of fit statistics for the model fit Rasch analyses of the structure of the BCQ *Significant misfit to the Rasch modelTable2

Table 4
Rasch model fit statistics of item-fit residuals for the final analysis of the testlets in the structure and attunement scales of the BCQ

Table 5
Converting ordinal scores into interval-level scores for the structure scale of the BCQ

Table 6
Converting ordinal scores into interval-level scores for the attunement scale of the BCQ

Table 7
Statistical comparisons between the original UK (n = 468) and NZ (n = 650) samples on ordinal raw scores and interval-transformed scores of the BCQ scales a Cohen's d