Infants develop socioemotional and cognitive skills through attuned interactions with their caregivers (Evans & Porter, 2009; Kim et al., 2017; Leerkes et al., 2009). Caregiver mental health problems, such as depression and/or anxiety, affect both caregiver-child interactions in terms of increased caregiver intrusiveness (meta-analysis by Lovejoy et al., 2000), infant withdrawal (Braarud et al., 2013; Feldman et al., 2009; Smith-Nielsen et al., 2019), lowered maternal sensitivity (meta-analysis by Bernard et al., 2018), and aspects of long-term child psycho-social development, e.g. cognitive functions and socioemotional development (Murray et al., 2010). This suggests that poorer caregiver-infant interaction quality may be one of the routes for transmission of risk in general from parent to infant (Erickson et al., 2019; Stanley et al., 2004). Early detection of infants at risk with valid instruments for assessing caregiver-child interaction quality is therefore crucial. Although many instruments exist, systematic reviews of their psychometric properties concluded that there was a need for an examination of the theoretical dimensions assumed to underlie the items as the majority of the instruments lacked structural or factorial validity (Gridley et al., 2019; Lotzin et al., 2015). Therefore, the aim of the present study is to examine the construct validity of the Coding Interactive Behaviour (CIB; Feldman, 1998), widely used for assessing caregiver-infant interaction. In addition, we examine the effect of maternal depression and co-morbid anxiety on the interaction quality.

Assessment of Caregiver-Child Interactions

It has been proposed that caregiver-child interaction quality is best assessed using observational tools due to less susceptibility to bias associated with parental self-report, e.g. understanding, recall/memory bias, and social desirability (Lotzin et al., 2015). Following this, attachment and developmental psychology researchers have a long tradition for observing and assessing caregiver-infant interactions (e.g. Ainsworth, 1967; Cohn & Tronick, 1988). For decades, children and their caregivers have been observed in their homes and in the lab, playing freely or in structured tasks, over the course of several minutes up to hours (for reviews see Gridley et al., 2019; Lotzin et al., 2015). Typically, interactions are video recorded and coded using one of 24 available coding systems for interactive quality (Lotzin et al., 2015). Lotzin et al. (2015) conclude in their systematic review that most tools demonstrate a valid rating procedure, reproducibility, and discriminant validity but lack factorial and predictive validity, meaning the internal structure of the measurement and its capacity to predict later outcomes from caregiver-infant interactions are less established. They recommend future studies improve the quality of caregiver-infant interaction research by examining the theoretically assumed directionality of the tools using factor analysis (Lotzin et al., 2015). A more recent systematic review of observational measures of interaction quality commonly used in randomised controlled trials (RCT) conclude that for younger children (age 0–3 years), the evidence for validity and reliability was scarce and weak (Gridley et al., 2019). They argue that this lack of validity evidence is a severe limitation when assessing change in interaction quality in RCTs, as we cannot be sure that the measurement is robust enough to measure the same construct over time.

The assumed theoretical dimensions underlying a range of the widely used observational tools have also been questioned. Mesman and Emmen (2013) systematically reviewed observational tools measuring parental sensitivity and compared these to Ainsworth et al.’s (1978) definition of sensitivity, i.e. the ability to (1) notice the child signals, (2) interpret these signals correctly, and (3) respond promptly and appropriately. While the reviewed tools include scales referring to the most salient behaviours from Ainsworth’s definition, many also include parental positive affect, warmth, and affection in their sensitivity composites. Mesman and Emmen (2013) argue that parental sensitivity and positive affect are two related but distinct dimensions of parenting and that in a subgroup of parents, high levels of positive affect are related to extreme intrusiveness rather than sensitivity. Hence, from a methodological as well as a theoretical point of view, factor analyses examining the theoretical dimensions underlying observational tools are needed.

The CIB is a widely used instrument for assessing caregiver-infant and caregiver-child interaction quality, has been used in many countries (e.g. Denmark, France, United States of America, Germany, Brazil; Feldman, 2012), and it has been shown to capture differences in parent-child interactions related to child biological risk (Feldman et al., 2002), child socio-emotional risk (Dollberg et al., 2006), caregiver psychological risk (Feldman et al., 2009), and change following intervention (Ferber et al., 2005). For a review of the results from studies using the CIB, see Feldman (2012). The CIB consists of 33 items; 18 related to the parent’s behaviour (e.g. parental acknowledging), eight to the child’s behaviour (e.g. child initiation), five to the dyad (e.g. dyadic reciprocity), and two represent the lead-lag of the interaction, i.e. child-led versus parent-led (Feldman, 1998). These items mainly focus on the global nature and flow of the interaction (i.e. the involvement and individual style of the two participants in the dyad). Scores reflect the reciprocity and adaptation as well as the affectivity and attention from the partners (Feldman, 1998). The items are coded and then aggregated into constructs based on theory/research in early social development, e.g. sensitivity.

Despite the frequent use of the CIB, to the best of our knowledge, only two previous studies have examined its factor structure. The factor structure proposed by Feldman (see Table 1; hereafter referred to as the original theoretical model), the developer of the CIB, was confirmed in a study of 483 caregiver-child dyads (cited in Feldman, 2012). A recent but smaller study of 52 mothers, 41 fathers, and their 5-year-old children in Denmark failed to confirm the theoretically assumed factor structure for the parental composites (Steenhoff et al., 2019). Using exploratory factor analysis, this study found different factor structures underlying mother-child and father-child interactions. The factor structure underlying mother-child interaction quality showed resemblance to the original theoretical model in the sense that a sensitivity and an intrusiveness composite were identified, the items included in the composites were overlapping but also different from the original theoretical model. These findings might not be surprising since Feldman (2012) argues that while some items of the CIB are to be considered core items (see Table 1) and expected to be included in specific composites across contexts, other items may be sensitive to the specific context, e.g. culture or child age. This stresses the importance of examining the factor structure underlying caregiver-infant interactions across cultures and child ages.

Table 1 Original theoretical model

Studies using the CIB generally refer to the composites proposed in the original theoretical model, yet with differences in relation to the remaining items included in the composites (e.g. Cordes et al., 2017; Dollberg et al., 2010; Egmose et al., 2017; Feldman et al., 2002, 2009; Ferber et al., 2005; van Huisstede et al., 2019). The composites included in the studies using the CIB generally have high Cronbach’s alpha values (≥0.80), suggesting high internal consistency between items in the composites but, apart from this, studies do not typically detail how decisions were made in relation to which items to include in or exclude from the composites. However, forming composites based on measures of internal consistency is problematic as it only informs on the composite’s reliability, i.e. the items measure something consistently, and not its validity, i.e. the items measure what they are intended to (Tavakol & Dennick, 2011). One of the most basic assumptions in scale validation is unidimensionality, i.e. items measure a single latent construct. Though necessary to establish unidimensionality, high levels of internal consistency are not sufficient (Tavakol & Dennick, 2011). Therefore, it is possible a composite with high levels of internal consistency captures two distinct but related constructs, e.g. parental positive affect and sensitivity. When the multidimensionality of a composite is not recognised, it can potentially limit our understanding of caregiver-child interactions and its effect on child development, especially when two dimensions show differential predictive associations with child development (Mesman & Emmen, 2013). Davidov and Grusec (2006) found that parental sensitivity to distress was associated with child regulation of negative affect, but parental warmth was associated with child regulation of positive affect. Thus, examination of the factor structure underlying observational tools, such as the CIB, has important implications for research and theory.

Effects of Postnatal Depression and Anxiety on Mother-Infant Interactions

Postnatal depression (PND) and anxiety (PNA) are among the most common psychiatric conditions in the postnatal period and in the population in general. In the postnatal period, meta-analytic evidence shows that about 13% of mothers experience depression (Stephens et al., 2016), and 15% of mothers experience high levels of anxiety, and 9.9% fulfil criteria for an anxiety disorder (Dennis et al., 2017). Another meta-analysis shows that co-morbid anxiety and depression diagnosis is present in 9.3% of mothers (Falah-Hassani et al., 2017).

Research on the effects of PND and PNA on different aspects of mother-infant interactions has yielded mixed results. To our knowledge, only one study has used the CIB to measure the interaction quality in mothers with PND and PNA. Feldman et al. (2009) used the CIB at 9 months postpartum and compared mothers with PND (n = 22), PNA (n = 19), and matched controls (n = 59). Anxiety and depression status were assessed using self-report questionnaires (State-Trait Anxiety Inventory (Spielberger et al. (1970)) and Beck Depression Inventory (Beck, 1978), respectively). They found that the PNA-group was more intrusive compared to the other two groups. In contrast, mothers suffering from PND were more withdrawn and less sensitive compared to both the anxious mothers and the control group. Studies using other measurements of mother-infant interaction quality have also found a negative effect of PND and PNA on interaction. Nath et al. (2019) found that only PND was associated with lower maternal sensitivity while PNA was not, neither in itself nor as comorbid with depression. In contrast, Neri et al. (2015) found that PNA was associated with lower sensitivity, PND with lower maternal affect, and a combination of the two was associated with lower infant engagement. Finally, Crugnola et al. (2016) found that only anxiety, and not depression, was associated with more negative infant affect and less positive maternal affect. In conclusion, previous research consistently shows that mother-infant interactions may be negatively affected when mothers suffer from PND and/or PNA, with the interactions being characterised as less sensitive, mothers being more withdrawn and intrusive, as well by more negative affect in either the infant, the mother, or both. At the same time, maternal anxiety and depression have been inconsistently associated with various negative aspects of the mother-child interaction across studies.

The Present Study

The purpose of the present study is twofold: (a) to examine the factor structure of the CIB and (b) to examine the effect of maternal depression and co-morbid anxiety on the derived factors.

We examine the factor structure of the CIB in an at-risk population with mothers and their infants aged 2–6 months. Here, at-risk is defined as a high probability of the mothers having PND (see the Methods section for a further description). We use a combination of exploratory and confirmatory approaches to find the best fitting factor structure. The exploratory approaches are useful when the aim is to uncover complex patterns in the data that can then be hypothesis tested using confirmatory analyses (Yong & Pearce, 2013). We compare the original model developed by Feldman (1998) with an alternative theoretical model (see the Methods section for how we developed it) and a data-driven model that is based on an exploratory factor analysis.

Finally, we examine the extent to which the best fitting model shows measurement invariance. Ensuring invariance is a prerequisite for investigating possible group differences, because otherwise you cannot be sure whether possible significant differences are due to an actual difference in the populations or to the measurement functioning differently (Putnick & Bornstein, 2016). Measurement invariance is especially important concerning psychological constructs that may have a different meaning in different clinical groups (Putnick & Bornstein, 2016). We examine measurement invariance for mothers with and without a PND diagnosis.

Our second objective is to examine if and how a PND diagnosis and PNA caseness are associated with mother-infant interaction quality as compared with subclinical levels of depression and co-morbid anxiety. We use the composites identified in the best fitting model to measure mother-infant interaction quality. We expect that (a) maternal PND and (b) maternal PNA caseness status is associated with less optimal scores on the CIB composites and (c) co-morbid clinical PND and PNA caseness is associated with less optimal scores on the CIB composites compared to PND a diagnosis alone, PNA caseness alone, and no PND diagnosis or PNA caseness.

Materials and Methods

Participants and Procedures

The current study’s protocol was pre-registered at https://osf.io/uw7pk. The sample for this study is part of a larger RCT, the Copenhagen Infant Mental Health Project (CIMHP; Væver et al., 2016). Participants were recruited by a public health visitor during routine home visits (more than 90% of all Danish families receive home visits from a health visitor during the infant’s first year; Pant & Pedersen, 2019). The health visitors routinely screen the mothers for depression using the Danish translation of the Edinburgh Postnatal Depression Scale (EPDS; Cox et al., 1987) around 2 months postpartum and possibly at later time points depending on the health visitor’s clinical judgement. Mothers were invited to participate in the study if they scored 10 or above on the EPDS. The family was then offered a home visit by a clinical psychologist from the research team. During this home visit, written informed consent was obtained from the mother and in cases of shared custody of the child from the partner in approval of the infant’s participation. Furthermore, a mother-infant interaction was recorded (i.e. 5 min of free play), and the clinical psychologist conducted a diagnostic interview to assess depression. An online-survey was also sent to the mothers to be filled out at the time of the home visit. Inclusion criteria for the current study were: (1) EPDS score ≥ 10, (2) mother’s age ≥ 18, (3) mother spoke and understood Danish, and (4) infant age < 6 months since the instrument used for coding mother-infant interactions changes at 6 months (cf. Feldman, 1998). Exclusion criteria were: (1) infant had a severe medical condition and known autism or early retardation, (2) extremely premature birth (gestational age < 28 weeks), (3) maternal bipolar disorder/psychotic disorder and known severe intellectual impairment, (4) mother attempted suicide during pregnancy or postnatal, (5) mother present alcohol/substance abuse, and (6) the recording of mother-infant interaction was missing. The sample in this study consists of 419 mother-infant dyads. The project was approved by the local ethical review board (Approval number: 2015-10).

Measures

Maternal depression

Maternal depression was assessed using the Structured Clinical Interview for DSM-5 disorders – Research Version (SCID-5-RV; First et al., 2015), a semi-structured interview used to assess DSM-5 diagnoses. A trained research psychologist, routinely supervised, administered the interview at the home visit. In the present study, we use the interview to assess the presence of current major depressive episode in the mothers. To ensure inter-rater reliability, a randomly selected subset (22%) was rated by a certified SCID-5 interviewer blind to diagnostic status. Inter-rater reliability levels were strong (Cohen’s κ = 0.89) as reported by Smith-Nielsen et al. (2018).

Maternal anxiety

Maternal anxiety was measured using the anxiety scale of the Hopkins Symptom Check List (SCL-92; Olsen et al., 2004). The SCL-92 is a self-report symptom inventory consisting of 92 items rated on a 5-point Likert scale, ranging from 0 (“Not at all”) to 4 (“Extremely”). The measure assesses current psychological distress in the timeframe of the past week. In the present study, we use the cut-off score for anxiety caseness (1.15 for the anxiety subscale in a Danish sample; Olsen et al., 2006) that has been established using national norms for adult women. We created a binary variable dividing the mothers into those with clinical levels of anxiety and those without. The internal consistency of this subscale was good in this sample (Cronbach’s α = 0.87).

Mother-infant interaction quality

Mother-infant interaction quality was assessed using the Coding Interactive Behaviour (CIB; Feldman, 1998). CIB is a global rating system for social interactions, coded during 5 min of mother-infant interaction (i.e. a free play situation) assessed during the home visit. The measure consists of 33 items; 18 items relate to the parent’s behaviour (e.g. ‘Acknowledging’), eight to the child (e.g. ‘Initiation’), five to the dyad (e.g. ‘Dyadic Reciprocity’), and two represent the lead-lag of the interaction, i.e. ‘Child-Led’ and ‘Parent-Led’. All items are rated on a scale ranging from 1 (representing a minimal level of the attitude/behaviour) to 5 (representing a high level) with half-point increases.

According to criteria specified by Feldman’s research team (M. Singer, personal communication, December 16, 2014), all coders were trained to attain min. 85% agreement at item-level on 13 videos coded by Feldman’s research team. On each item, the score was considered correct, when it did not deviate with more than 1 point from the original score. For the present study, six coders were involved in the coding process. One coder (IE) was considered the main coder, and her codings served as the ‘gold standard’, so that 20% of each of the other’s codings were compared to her codings of the same videos. The coders were all trained to reliability and afterwards had varying levels of experience in terms of using CIB and observational methods more broadly, but the main coder was highly experienced both in relation to CIB and other observational methods. The coders had no prior knowledge of the dyads and were blind to maternal depression and anxiety status. Reliability values were calculated for the entire CIMHP sample. 21.6% were coded with an inter-rater agreement of 91.7%. Inter-rater reliability was calculated based on an absolute agreement one-way random effects model (Koo & Li, 2016). This showed an excellent inter-rater reliability (ICC = 0.97, 95% CI [0.97; 0.97]). Cohen’s kappa indicated almost perfect reliability, κ = 0.91.

Statistical Analyses

We developed an alternative theoretical model, as previous studies from related disciplines have shown that it can be difficult to confirm models derived from exploratory approaches in new settings and that testing different theoretical models could provide a more rational route for validation of instruments (Kozinszky et al., 2017). The alternative theoretical model (see Table 2) was adjusted according to the age of the infants in our study (2–6 months), adding items to the composites comparable to previous studies (e.g. Egmose et al., 2017; Feldman et al., 2009) and theoretical considerations (Mesman & Emmen, 2013). In the alternative model, the sensitivity composite is more aligned with Ainsworth’s definition of sensitivity. To accomplish this, we added the items ‘Consistency of Style’, ‘Adaptation-Regulation’, and ‘Infant-Led’ in order to stress the importance of the mother following the child’s signals instead of a fixed set of parenting behaviours (Ainsworth et al., 1978; Mesman & Emmen, 2013). Further, we propose the presence of a composite measuring the degree to which the mother is affectively withdrawn or positively engaged with her infant (Feldman et al., 2009) which is also in line with Mesman and Emmen’s (2013) point about separating sensitivity and affect into distinct parenting dimensions. We also removed the ‘Forcing’ and ‘Parent Anxiety’ scales from the intrusiveness composite. Forcing or physical manipulation is a frequent behaviour when parents interact with young infants and is therefore not necessarily intrusive at this age (e.g. the mother physically moves the infant who is expressing discomfort). According to the CIB manual, ‘Parent Anxiety’ can be expressed in different ways, both as unpredictable enthusiasm, which often is intrusive, and as long periods of silence and/or frequent looks to the observer, which is not necessarily intrusive. Finally, we have collapsed the two child composites into one composite reflecting the degree of the child’s involvement in the interaction, since high scores on ‘Negative Emotionality’ and ‘Child Withdrawal’ are two different ways of expressing non-engagement (Feldman, 1998).

Table 2 Alternative theoretical model

To compare the three models (i.e. the two theoretical and data-driven models) and test the best-fitting one for measurement invariance, SPSS Statistics 27 (IBM, Chicago, IL) was used for the exploratory factor analyses (EFA), and the lavaan package in R (Rosseel, 2012) was used for the confirmatory factor analyses (CFA) and the multi-group confirmatory factor analyses (MGCFA). First, we conducted the EFA on 30% of our sample (NEFA = 124), randomly selected from a depression strata (i.e. whether they have a PND diagnosis or not). Kaiser-Meyer-Olkin measure of sampling adequacy (KMO = 0.89) indicates this sample size to be adequate and meritorious (Kaiser, 1970, 1974). The extraction method was maximum likelihood. Following Costello and Osborne’s (2005) recommendation, multiple EFAs were run with a fixed number of factors to be extracted based on the inflection point on the scree plot. The oblique rotation method, direct oblimin, was used to improve interpretation. The EFAs were compared to determine the best-fitted model (i.e. item loadings above 0.30, few item crossloadings with a difference of <0.2 between primary and secondary loading, no factors with fewer than three items) for use in the CFAs. Second, the CFAs were conducted on the remaining 70% of the sample (NCFA = 294, nPND = 173, nnon-depressed = 121) using maximum likelihood estimation in order to compare the three models (i.e. the EFA-based, the original theoretical, and the alternative theoretical model). As some of the items were not normally distributed—and thus violating the assumptions of maximum likelihood estimation—we ran the CFAs with maximum likelihood estimation with robust standard errors (MLR). However, as these were not pre-registered, those results can only be interpreted as exploratory. Factors with less than three items were excluded due to their weak and unstable nature (cf. Costello & Osborne, 2005), and we expected all composites to correlate with each other. The fit indices used for the CFA were the χ2-statistic (χ2/df), comparative fit indices (CFI), and the root mean square error of approximation (RMSEA). For a model to have an acceptable fit, χ2/df should be ≤3, CFI ≥ 0.90, and RMSEA < 0.08 (Brown, 2015; Shek & Yu, 2014). If acceptable model fit was not observed, we would allow for theoretical meaningful error correlations based on similarities or reversed items (cf. Brown & Moore, 2012). When comparing the three models, the model with the lowest Akaike information criteria (AIC) and Bayesian information criteria (BIC) is the best fitting model and will be used in subsequent analyses. To compare the nested models (i.e. models with and without error correlations), we ran likelihood ratio tests to investigate whether the model with error correlations were significantly better compared to the standard model. However, since these analyses were not pre-registered, the results can only be interpreted as exploratory findings.

Third, the MGCFAs were conducted on the entire sample (N = 418) using maximum likelihood estimation in order to investigate measurement invariance for the best-fitted model in the two independent groups (mothers with and without PND). As with the CFAs, we also ran the MGCFA using MLR exploratory, as these analyses were not pre-registered. Widaman and Reise (1997) identified four main steps for testing measurement invariance: configural (the model is organised equivalently in the different groups), metric (each item contributes to the latent construct equivalently across groups), scalar (equivalent items intercepts, meaning that mean difference in the latent construct capture all mean differences in the items’ shared variance), and residual (the sum of shared and error variance is equivalent across the groups). Following Putnick and Bornstein (2016), the baseline unconstrained model was compared to a weak-invariance model (constrained factor loadings) to test for metric invariance. This weak-invariance model was then compared to a strong-invariance model (constrained factor loadings and item intercepts) to test for scalar invariance. Finally, this model was compared to a strict-invariance model (constrained factor loadings, intercepts, and residual variances) to test for residual invariance. Invariance was determined based on the criteria that the p value for Δχ2 should be >0.05, and ΔCFI should be equal or below 0.01 (Cheung & Rensvold, 2002).

To test main and interaction effects of PND and PNA on mother-infant interactions as measured with the best-fitting model, factorial ANCOVAs were used. These analyses were performed using SPSS Statistics 27. The dependent variables were the CIB composites obtained in the CFA. The independent variables were PND (yes/no) and PNA (yes/no). In all of the analyses, the covariates infant sex, mother’s educational level, and parity were controlled for, as studies have found these to have an effect on mother-infant interactions (Schiffman et al., 2003; Tronick & Reck, 2009). For maternal educational level, 5% of the data was missing, and 15% was missing for PNA. None of our observed variables were significantly associated with missingness (all ps ≥ 0.277), indicating that data was missing completely at random. Complete case analysis was, thus, run. All p values are two-tailed and assessed at a significance level of .05.

Results

Descriptive Statistics

Socio-demographic factors of the sample are presented in Table 3. There were no significant differences in educational level in the two groups (χ2(2) = 0.26, p = 0.88). Compared to the general population in the municipality, 52.3% of 20–44-year-old women had a middle-to-long formal education (Danmarks Statistik, 2021). In the present study, the majority of the sample (86.2%) had followed this form of education and was characterised as highly educated.

Table 3 Sample characteristics

Factor Analyses

Figure 1 shows the scree plot. The inflection point was at four factors, resulting in three EFAs being run with three, four, and five factors being extracted, respectively. In all three, the item ‘Negative Affect’ loaded below 0.3 and was not included. For the three factor EFA (EFA-3), there were no factors with fewer than three loading items. However, eight items crossloaded, with five of them crossloading with a difference below 0.2. EFA-4 had no factors with fewer than three items, and five items crossloaded with two of them having a crossloading difference below 0.2. The item ‘Hostility’ had a factor loading below 0.3 and was not included in the EFA-4 model. Finally, EFA-5 had no factors with less than three items, and seven items crossloaded with five of them being below the threshold. The item ‘Affectionate Touch’ had a factor loading below 0.3. The best-fitting model was thus EFA-4, presented in Table 4, and used in the CFA.

Fig. 1
figure 1

Scree plot

Table 4 Factor loadings on the 4-factor EFA model (Model 3)

Results of the CFAs can be seen in Table 5 (upper). None of the three models showed a good model fit as χ2/df > 3, CFI < 0.9, and RMSEA > 0.08. Since none of the models had a good fit, we allowed for theoretically meaningful error correlations (see Table 6 for an overview). For instance, some items represented similar features of interactive quality, such as ‘Child Positive Affect’ and ‘Child Vocalisation’ (which is coded from positive vocalisations only), while others represented reversed features, such as ‘Alert’ and ‘Fatigue’. The results of the models with error correlations are also in Table 5, with the new models being noted with an ‘a’. Further, the item ‘Affectionate Touch’ did not load significantly in Model 1 (original theoretical model) or 3 (data-driven model) and was thus excluded from Model 1a and 3a. Model 3a was the only model that showed a good fit on the majority of the parameters. Even though Model 1a and 2a (alternative theoretical model with error correlations) had lower AIC and BIC values (indicating that they may have a better fit when not taking the amount of items into account), they did not reach a good model fit on any of the other parameters, and thus Model 3a was considered the best-fitting model of them all and was used in subsequent analyses. As mentioned, some of the items were not normally distributed, meaning that using a maximum likelihood estimation may bias the results. We thus ran the CFAs again using MLR. These were not pre-registered and can only be interpreted as exploratory. The results in Table 5 (lower) shows that the results do not differ much from the ones using a maximum likelihood estimator apart from the χ2-values being lower, and thus Model 3a does now show acceptable model fit on all of the parameters.

Table 5 Fit of the CFA on the three models (N = 290)
Table 6 Correlated errors in the confirmatory factor analyses

As AIC and BIC cannot be used to assess model fit differences in nested models (i.e. comparing fit in models with and without error correlations), we used exploratory likelihood ratio tests to investigate possible significant differences. We found that there were significant differences between Model 1 and 1a (LR = 240.81, p < 0.001), Model 2 and 2a (LR = 229.22, p < 0.001), and Model 3 and 3a (LR = 678.78, p < 0.001) with the model with error correlations fitting significantly better for all three cases as seen in Table 5.

To summarise this best-fitting model, it consists of four composites, three maternal and one child. The maternal sensitivity composite is quite comparable to the two theoretical models, and the maternal affect composite is similar to the one in the alternative theoretical model (i.e. the withdrawal composite as it is a sample with or at-risk of depression). The final maternal composite is how controlling the mother is in her behaviour towards the child (e.g. moving the child around), and the child composite measures how engaged the child is in the interaction.

MGCFA results are presented in Table 7 (upper). The unconstrained model (M1) showed a good fit (χ2/df = 2.70), however, CFI was just below the threshold for a good fit (CFI = 0.89), but overall the results were close to the threshold for an acceptable fit, indicating configural invariance (i.e. the same factor structure across the two groups). When comparing M1 and M2, Δχ2 was non-significant, and ΔCFI was below the threshold, indicating metric invariance (i.e. the factor loadings were equivalent in the two groups). Scalar invariance was also indicated when comparing M2 and M3 as again Δχ2 was non-significant, and ΔCFI was below the threshold, meaning factor loadings and item intercepts were equivalent across the two groups. When comparing M3 and M4, ΔCFI was below the threshold, but Δχ2 was significant, indicating residual non-invariance (i.e. residual variances were not equivalent in the two groups). The exploratory MGCFA using MLR is shown in Table 7 (lower). These results do not differ from the confirmatory results apart from the fact when comparing M3 and M4, Δχ2 was no longer significant, indicating residual invariance. However, since these analyses were not pre-registered, they can only be interpreted as exploratory and must be confirmed in a new study.

Table 7 Group comparison (MGCFA) of Model 3a (data-driven), N = 418 (nPND = 235, nnon-depressed = 183)

Associations Between PND, PNA, and CIB

Descriptive statistics for the CIB composites are presented in Table 8. The factorial ANCOVAs showed no significant main or interaction effects of PND or PNA on the four composites (all Fs ≤ 2.52, ps ≥ 0.11).

Table 8 CIB composites as a function of PND and PNA

Discussion

The aims of the present study were to (a) examine the factor structure of the CIB and (b) compare effects of maternal clinical depression and anxiety on mother-infant interactions with sub-clinical levels of depression and anxiety.

Factor Analyses of the CIB

The CFAs for the original model as proposed by Feldman (1998), the alternative theoretical model, and the data driven model all revealed a poor model fit. By allowing for theoretically meaningful error correlations, only the model based on the EFA showed a good model fit across indices. As previously mentioned, only two previous studies have examined Feldman’s proposed factor structure of the CIB (i.e. the original theoretical model) with one study confirming the factor structure (cited in Feldman, 2012) and the other failing to confirm it (Steenhoff et al., 2019). The alternative theoretical model also failed to reach the thresholds for a good model fit, but it did have the lowest AIC and BIC values of the three models. Interestingly, both the original and alternative theoretical models had lower AIC and BIC than the data-driven one. Kozinszky et al. (2017) show that models based on EFAs will have difficulty fitting other samples than the ones they are based on and argue that alternative theoretical models may be a more rational way for testing and validating instruments across contexts and samples. Since the AIC and BIC values were lowest for the alternative theoretical model, it may be an indicator that this model may actually fit well across samples, and future studies should investigate further the different ways EFA models as well as more theoretical models fit samples. That said, it can be questioned whether a search for a universal CIB factor structure is reasonable. When an instrument is used across multiple populations, the assumption is that it assesses the same construct(s) on the same scale, i.e. measurement invariance (Widaman & Reise, 1997). However, Feldman (2012) argues that it is to be expected that the items of the CIB composites change across contexts because some parenting behaviours are more salient at one time of development or in one culture compared to another. For example, the item ‘Hostility’ has shown to be an important item to detect change in parenting behaviour in a high-risk sample (Steele et al., 2019), but it is probably not the most salient item in our Danish, well-functioning sample (ca. 96% of our sample had the minimum score of 1). As changes are to be expected, it may be difficult to find a universal factor structure with good model fit across all populations.

In the present study, the model based on the EFA was the only one to show a good model fit across the different indices when allowing for theoretical meaningful error correlations. Feldman (2012) argues that it is to be expected that her original model may not be confirmed and recommends conducting an EFA in different samples as some items will vary depending on the context, e.g. cultural differences or child age. Previous studies generally use the same composites of the CIB, but the items included in each composite do differ across studies (e.g. Dollberg et al., 2010; Feldman et al., 2002; Ferber et al., 2005). However, they all seem to agree that composites tapping into important aspects of parenting behaviour, such as ‘Maternal Sensitivity’, ‘Intrusiveness’, or ‘Maternal Controlling Behaviour’, are integral for the instrument. Further, the composite ‘Maternal Social Withdrawal’ was also used in another study including some of the same items as obtained in our analyses, i.e. ‘Parent Depressed Mood’ and ‘Parent Positive Affect’ (reversed) (Feldman et al., 2009). The present study’s proposal of a new factor structure of the CIB only serves to further complicate the comparison of results using the CIB to measure mother-infant interactions. Thus, it may be in the best interest to conduct an EFA in each new sample as Feldman (2012) suggests, rather than try to use composites in a universal way when they seem liable to change due to sample characteristics. In practice, this may complicate the use of the CIB in two ways: comparing the findings across population and using it in longitudinal studies and as an outcome measure in interventional studies where you measure across age groups. Gridley et al. (2019) argue that it may be difficult to use observational measures of interaction quality in RCTs when validity has not been established. When the factor structure—and thus the construct we are interested in measuring—may be liable to change, a detection of change in interaction quality may not be due to the intervention but due to the changing factor structure. We were not able to investigate the latent growth model of the CIB to assess invariance across time, but future studies should make it a priority to establish how the factor structure may change across age groups.

Group comparison between mothers with and without PND showed configural, metric, and scalar invariance. This indicates that the factor structure was similar in the two groups, and that the maternal and child interactive behaviours co-varied in similar ways irrespective of diagnostic status. Measurement invariance is a precursor to any group comparison as observed response differences will depend only on latent trait scores (i.e. true differences in the trait) and not on group membership. This means that statistically significant differences between groups can be considered non-biased and meaningful (Putnick & Bornstein, 2016). Residual invariance was not confirmed, indicating that the item’s residual/unique variance was unequal in the two groups. This means that variance not shared with the latent factors and the measurement error is different for mothers with and without PND. However, residual invariance is of debatable importance since strict invariance is rarely achieved in practice (Van De Schoot et al. (2015)). Further, residual invariance is not a prerequisite for testing mean difference, as the residuals are not part of the latent factor (Vandenberg & Lance, 2000), and confirming invariance up to the level of scalar/strong invariance is generally considered acceptable to claim that a measure is invariant across groups (Putnick & Bornstein, 2016). Therefore, we accepted residual non-invariance and proceeded to test for mean differences across the two groups.

Effects of PNA and PND on Mother-infant Interactions

Contrary to our study hypotheses, we found no main and interaction effects of clinical PND and PNA caseness on the four CIB composites, indicating no negative effect on mother-infant interaction quality. Previous research has consistently found that the interaction can be negatively affected for mothers with PND and/or PNA – both using the CIB (Feldman et al., 2009) as well as other instruments (Crugnola et al., 2016; Nath et al., 2019; Neri et al., 2015). It is therefore somewhat surprising that we did not find any significant effects on the composites. However, there are several possible reasons for these differing results.

The sample in the present study was considered an at-risk sample as all mothers were included based on an EPDS score of 10 or above. While we had a non-PND group, they still had elevated levels of depressive symptoms and could therefore not be considered as a traditional healthy control group. Instead, it should be considered a subclinical group, and it is possible that we would have detected differences between our sample and a control group. However, on average, our sample scored relatively well on the CIB composites; e.g. even the mothers with a combination of PND and PNA scored higher or equivalent on Maternal Sensitivity in comparison to the healthy control groups in previous studies (Feldman et al., 2009; Nath et al., 2019; Neri et al., 2015). Additionally, it may be that mothers were able to compensate for potential parenting difficulties during the 5 min of recorded interaction, and this is also a relatively short amount of time and may not be representative of the interaction during the rest of the day. However, it is an observational length comparable to that in other studies. Following this, the interaction was a free play situation. There is some empirical evidence that the quality of the interaction may depend on the situation it is being assessed in (i.e. a relatively stress-free free play compared to a more structured demanding situation) (Joosen et al., 2012; Leerkes et al., 2009). Therefore, different results may have been found if the interactions had been coded on a more distressed interaction. However again, the previous studies also used a stress-free play situation to assess interaction quality, making it puzzling that we do not find any significant effects. Finally, it has been argued that there can be a self-selection bias (i.e. those who participates differ in a systematic way from those who do not participate) in research and, here, is willing to be filmed for observation, leading to a distorted representation and possibly contributing to null-findings (cf. Braver & Bay, 1992). However, the previous studies also made use of videotaped observations to rate interaction quality, meaning that this bias in sample should be the same across the studies and can therefore not fully explain the difference in significant findings. It may thus be that the non-significant results are due to sample characteristics, i.e. cultural factors. In a Norwegian study investigating how prematurity and PND affect infant behaviour, they also found some non-significant results (Braarud et al., 2013). They attributed these to the longer maternity leave that is a stable across Scandinavian countries, and it may be that this longer leave gives our sample’s mothers a sense of support and security, meaning that poor mental health may not be such a detriment to interaction quality.

Limitations

When interpreting the results, limitations need to be taken into account. First, two factors in the theoretical models only had two loading items, meaning that we were unable to include them in the CFAs. Second, the non-PND group should be considered a subclinical group and as such, we are only able to investigate effects of clinical levels of PND with subclinical levels. Measurement invariance of the CIB should thus also be examined in relation to healthy control mothers, as it is possible that different results could be obtained between more differentiated groups. Third, while PND-status was based on a diagnostic interview, PNA-status was derived from a self-report measure investigating current overall psychological distress and was not based on any diagnostic criteria. It may thus be that results would change if we had used different measurements and criteria for clinical status. However, national norms and cut-offs have been established (Olsen et al., 2006), with the cut-off indicating the difference between normal distress and clinical anxiety cases. While it is a strength that we have randomly partitioned the sample for EFAs and CFAs, our fourth limitation is the reduction in sample size, which may have led to instability of the models. For the CFA, we are in line with power recommendations from MacCallum et al. (1996), but that may be at the expense of the EFA-sample (although the KMO indicated that our reduced sample size was adequate for analysing). However, Costello and Osborne (2005) showed that even with very large sample sizes (a 20:1 subject to item ratio), the analyses were still error-prone in optimal data, and they recommended that since EFAs are exploratory in nature to save the power for conducting CFAs. Nonetheless, it is important for our findings to be replicated in larger samples. Finally, error correlations were needed to reach a good model fit, but this may lead to problems of bias in parameter estimates, and good model fit may only be due to sampling error (Hermida, 2015). However, we did only allow for theoretical meaningful error correlations with items dependent on each other in their coding (e.g. the child items ‘Alert’ and ‘Fatigue’ are almost the reverse of each other). This may be an indication that the CIB consists of too many items that are interdependent on each other. For example in a study by Haltigan et al. (2019), the authors were able to reduce the number of items of the Atypical Maternal Behaviour Instrument for Assessment and Classification (AMBIANCE; Bronfman et al., 2009) assessing disrupted caregiver communication with the infant using item response theory. In doing so, it made the instrument more applicable in clinical practice and increased the validity of the instrument. We suggest that future studies should investigate if the same can be of benefit to the CIB.

Conclusion

In conclusion, the present study confirmed a four factor model of the CIB consisting of the composites ‘Maternal Sensitivity’, ‘Child Engagement’, ‘Maternal Social Withdrawal’, and ‘Maternal Controlling Behaviour’, and configural, metric, and scalar invariance between mothers with and without PND was reached. These results indicate that the CIB using this factor structure is a valid instrument in measuring mother-infant interaction quality regardless of whether the mother is suffering from PND or not. However, we also argue that these parenting items are liable to change due to sample characteristics such as culture or child age, and we would therefore urge users of the CIB to consider the challenges in using such an instrument, both across different cultural contexts as well as in longitudinal studies. Future studies should investigate the latent growth model of the CIB in order to assess invariance across time points. We would, thus, argue that the factor structure of the CIB needs to be validated in more culturally diverse samples before arguing that one sensitivity composite for example is more valid than another. Further, no cut-off scores exist, so the validity of using the composites of the CIB would perhaps also benefit from future research investigating when scores are clinically relevant and when they are just part of the variation in a normal range. Finally, we think that a next step is using item response theory to shorten the CIB to be more applicable in clinical practice. Further, using this method, we can investigate whether some of the items may be more stable across contexts as well as salient for assessing the desired interaction behaviour compared to the use of composites to further ensure the construct validity of the CIB.