Comparing the factor structures and reliabilities of the EPDS and the PHQ-9 for screening antepartum and postpartum depression: a multigroup confirmatory factor analysis

To evaluate and compare the factor structure and reliability of EPDS and PHQ in antepartum and postpartum samples. Parallel analysis and exploratory factor analysis were conducted to determine the structure of both scales in the entire sample as well as in the antepartum and postpartum groups. McDonald’s omega statistics examined the utility of treating items as a single scale versus multiple factors. Multigroup confirmatory factor analysis (MCFA) was utilized to test the measurement invariance between the antepartum and postpartum groups. Two-factor models fit best for the EPDS in both the antepartum and postpartum groups; however, the most reliable score variance was attributable to a general factor for each scale. MCFA provided evidence of weak invariance across groups regarding factor loadings and partial invariance regarding item thresholds. PHQ-9 showed a two-factor model in the antepartum group; however, the same model did not fit well in the postpartum group. EPDS should be preferred to PHQ-9 for measuring depressive symptoms in peripartum populations. Both scales should be used as a single-factor scale. Caution is required when comparing the antepartum and postpartum scores.


Introduction
Perinatal depression is a unipolar, non-psychotic depressive disorder (Howard et al. 2014) characterized by specific feelings and thoughts about the parental role (Langan & Goodbred 2016). It is one of the leading complications for people during pregnancy (antepartum depression) or following childbirth (postpartum depression) (Howard et al. 2014;Howard & Khalifeh 2020). Recent meta-analyses of maternal depression observed a pooled prevalence of 15% and 14%, respectively, for pre-and postnatal depression (Liu Wang & Wang 2022;Yin et al. 2021).
Both new-onset and preexisting depression in pregnant or postpartum people are associated with increased maternal mortality, suicide, and self-harm, as well as adverse obstetrical, neonatal, and long-term outcomes for children (Howard & Khalifeh 2020). All of these consequences lead to substantially increased costs for healthcare systems (Knapp & Wong 2020;Luca et al. 2020). However, growing evidence has identified early screening and prompt management as crucial factors in reducing symptoms and preventing relapses in perinatal people and their families (Austin et al. 2017;O'Connor et al. 2019;Cena et al. 2021).
In line with the World Health Organization's (2020) recommendations, routine screening for perinatal depression through valid, reliable, and economical screening tools is probably the most widely accepted suggestion (Accortt & Wong 2017; ACOG Committee 2018). However, a consensus has not been reached on what scale can be considered Antonella Gigantesco and Fiorino Mirabella contributed to this work and should be considered co-last authors.
1 3 the gold standard. The most frequently validated and utilized screening tools are the Edinburgh Postnatal Depression Scale (EPDS), Beck's Depression Inventory (BDI), and the Patient Health Questionnaire (PHQ-9) (Sambrook Smith et al. 2022a,b). Differently from BDI, both the EPDS and PHQ-9 are free to use as public domain measures.
On a theoretical level, a well-developed and validated disease-specific questionnaire should measure the same construct across different settings and patient populations. Based on this premise and given the clinical and research needs, the PHQ-9 and EPDS are widely used to evaluate levels of depressive symptomatology in both pregnant and postpartum people, examine their developmental trajectories, and compare the results among different groups. However, previous studies have found inconsistent factor structures for both the PHQ-9 and EPDS depending on the perinatal period (i.e., antepartum versus postpartum), ranging from one-factor (e.g., Berle et al. 2003;Woldetensay et al. 2018) to three-factor solutions (e.g., Marcos-Nájera et al. 2018;Matsumura et al. 2020). It should be noted that very few studies have investigated the factor structure of the PHQ-9 with pregnant people and, as far as we know, no study has investigated it in a postpartum sample or tested the measurement invariance of this measure across the perinatal period. Furthermore, regarding the Italian version of both of these two scales, only the EPDS was validated in a perinatal sample (more specifically, a postpartum sample).
Therefore, the aim of the present study is to evaluate and compare the factor structure and reliability of both the EPDS and the PHQ in antepartum versus postpartum samples and test for measurement invariance across the perinatal period.

Study design and sample
The data presented here were collected as baseline data for a longitudinal study (March 2017-June 2018) on screening and early intervention for maternal perinatal anxiety and depressive disorders. Eleven publicly funded primary or obstetrics-gynecology secondary care centers located throughout Italy were involved in the study as recruitment sites. The inclusion criteria were being pregnant regardless of the trimester of pregnancy (antepartum group) or having a biological newborn aged ≤6 months (postpartum group), and being able to speak and read Italian. The exclusion criteria were having issues with drug or substance misuse and/or having ongoing psychotic symptoms. All participants signed informed consent forms after being provided oral and written explanations of the aims and protocol of this study. This study was approved by the ethics committee of the Ethical Committee of the Healthcare Centre of Bologna Hospital. The rationale and full methodology of the larger study have been described in the study protocol ).

Data collection
Each participant was interviewed in a private room inside the healthcare center by a clinical psychologist trained in perinatal clinical psychology and associated with the healthcare center. The aim of the interview was to gather information on the participants' current and past psychiatric conditions and the use of psychotropic drugs, as well as their current experience with symptoms of stress, anxiety, and depression. At the end of that interview session, all participants completed the EPDS and PHQ-9 themselves as self-audit. Information on the demographic, economic, and psychosocial as well as reproductive characteristics of participants was collected.

Edinburgh Postnatal Depression Scale
The EPDS (Cox et al. 1987) is the most widely used selfadministered instrument to screen for perinatal depression (Sambrook Smith et al. 2022a,b). It can be used to assess depression according to the DSM-5 (American Psychiatric Association 2013) criteria (Smith-Nielsen et al. 2018). The EPDS was originally designed to assess the severity of depressive symptoms in new mothers and was subsequently used to screen for antepartum depression. It assesses the frequency of each of the following depressive symptoms as experienced in the previous 7 days: anhedonia (two items), guilt, anxiety, panic attack, feeling overwhelmed, sleep disturbance, sadness, tearfulness, and suicidal thoughts. The validated Italian translation of the EPDS showed a Cronbach alpha coefficient of 0.79 and a Guttman split-half coefficient of 0.81 (Benvenuti et al. 1999).

Patient Health Questionnaire-9
The PHQ-9 (Kroenke et al. 2001) is a self-administered depression screening scale containing nine items corresponding to the DSM-IV (Association American Psychiatric 1994) criteria for depression. Furthermore, it can measure depression severity based on the DSM-5 (American Psychiatric Association 2013) criteria (Spitzer et al. 2014). The PHQ-9 is the most widely used depression measure across clinical practice settings worldwide (Hirschtritt & Kroenke 2017;Kroenke 2021) and has been identified as the most reliable depression screening tool (El-Den et al. 2018;Negeri et al. 2021). It assesses the frequency of each of the following depressive symptoms as experienced in the previous 2 weeks: anhedonia, depressed mood, insomnia or hypersomnia, fatigue or loss of energy, appetite disturbances, feelings of worthlessness or excessive guilt, diminished ability to think or concentrate, psychomotor agitation or retardation, and suicidal thoughts. The internal consistency (Cronbach's alpha) of the PHQ-9 administered to an obstetric−gynecology sample was 0.86 (Kroenke et al. 2001). The Italian translation of the PHQ-9 showed sensitivity, specificity, and positive predictive values of 39, 29, and 93%, respectively, for any depressive syndrome (Mazzotti et al. 2003).

Statistical analyses
Descriptive statistics were computed for each variable, including means and standard deviations (SDs) for continuous variables and frequencies and percentages for categorical variables. Parallel analysis using the R package EFAtools v0.4.1 (Steiner & Grieder 2020) was performed on a polychoric correlation matrix using the mean eigenvalues and 95th percentile eigenvalues of 5,000 simulated random datasets. The factor structures of both the EPDS and PHQ-9 were explored separately through exploratory factor analysis (EFA) and multiple-group confirmatory factor analysis (CFA) using the R packages EFAtools v0.4.1 (Steiner & Grieder 2020) and lavaan v0.6-11 (Rosseel, 2012). First, parallel analysis evaluated the number of factors that may be supported by the data in the entire sample as well as in the antepartum and postpartum subgroups by comparing actual eigenvalues to random eigenvalues sampled at the 95th percentile (Glorfeld 1995). Scree plots were also examined. The scree plot and eigenvalues associated with each factor were also used to identify the number of meaningful factors. Next, a series of EFA models with maximum likelihood extraction and oblique rotation was performed to evaluate item loadings. These analyses were repeated three times, setting the extracted number of the factors to three, two, and one given the results of parallel analyses and also because no studies indicated structures of four or more factors for both the EPDS and PHQ-9 (for a review of various factor models of the EPDS see Matsumura et al. 2020; for the PHQ-9 see Barthel et al. 2015;Smith et al. 2022a,b;and Marcos-Nájera et al. 2018). Factor loadings ≥ 0.32 were used in the factor designation (Tabachnick & Fidell 2019). Next, the model with the best fit was tested by the multiple-group CFA method in order to assess measurement invariance between pre-and postnatal groups. A well-fitting baseline model was established, and the effects of equality constraints across groups were evaluated by likelihood ratio tests. Evidence for reasonably good fit was assessed using standard fit indices, including the root mean square error of approximation (RMSEA; values close to 0.06 or below are considered good) and comparative fit index (CFI; close to 0.95 or greater). All tests were two-tailed, with the statistical significance level set at α = 0.05. Lastly, omega reliability coefficients were calculated using the R package Psych v2.2.9 (Revelle 2022). Omega total measures the total reliable variance for each scale, and omega hierarchical indexes the variance attributable to a single general factor. High values of omega total indicate an overall reliable scale, and high omega hierarchical values support interpreting item scores as a single scale.
All statistical analyses were performed with R version 4.2.0 (R Core Team 2022).

Sample characteristics
Approximately 30% of the subjects approached refused to participate in the study, and n = 1 subject was not eligible to participate due to ongoing psychotic symptoms. No participants dropped out during the baseline evaluation. The overall sample included 1477 people: 1166 pregnant people and 311 new mothers. The two groups did not differ in nationality, marital status, educational level, working status, economic status, having planned the pregnancy or not, resorting to assisted reproductive technology or not, and history of past abortions. Compared to pregnant people, new mothers were older (p < 0.01), were more likely to have previous pregnancies (p < 0.01), and had children living at the time of this pregnancy/birth (p < 0.01). The sociodemographic and reproductive information are shown in Table 1.

Parallel analysis
The number of factors identified by the parallel analyses with principal component analysis (PCA), exploratory factor analysis (EFA), and squared multiple correlation (SMC) was as follows: EPDS whole group: one, five, and six; EPDS antepartum group: two, six, and six; EPDS postpartum group: one, four, and NA; PHQ-9 whole group: one, three, and four; PHQ-9 antepartum group: two, four, and five; PHQ-9 postpartum group: one, five, and six.

Exploratory factor analysis (EFA)
For both the EPDS and PHQ-9, we ran EFAs comparing the two models suggested by parallel analyses (i.e., two-factor and three-factor models) using the entire sample, the antepartum sample, and the postpartum sample (see Table 2).

Reliability
Both scales performed similarly across measures of reliability and internal consistency, though the EPDS showed slightly higher ratings across all metrics. Scores on both scales had adequate alphas (.80 and .84 for PHQ-9 and EPDS, respectively) and similarly high overall reliable variance (omega total) based on a two-factor hierarchical model (Revelle & Condon 2019). Compared to the PHQ-9, Table 3 Confirmatory factor analysis indices of the twofactor and three-factor models of the Edinburgh Postnatal Depression Scale (EPDS) and Patient Health Questionnaire-9 (PHQ-9) The antepartum sample consists of 1166 pregnant people, while the postpartum sample consists of 311 people who gave birth to one or more children in the 6 months prior to the time of data collection. The entire sample includes both antepartum and postpartum samples The items' scale assignments are those indicated in Table 2

Comparison with previous studies
The results presented in this study supported a two-factor solution for both scales across perinatal samples. However, while the EPDS performs well in both the antepartum and postpartum groups in terms of factor model fit and reliability (alpha, omega, and average item correlation), the PHQ-9 shows adequate performance only in the antenatal group and has inconsistent factor loadings and poor model fit in the postpartum group. Therefore, our findings indicate that the PHQ-9 may not be well-adapted for measuring depressive symptoms in the postpartum Italian-speaking population and that the EPDS should be preferred. For both scales, however, caution is required when comparing antepartum to postpartum scores, as discussed below. Lastly, given that the general factor heavily saturates the individual factors in both scales, the EPDS and PHQ-9 should probably be used as single-factor scales.
The two-factor structure model of the EPDS was consistently observed in the whole sample (without using residual covariances) as well as separately in the antepartum and postpartum samples. The two factors detected were related to depression and anxiety symptoms, respectively. Invariance testing revealed that loadings can be equated across antepartum and postpartum but not the thresholds. This suggests that although the EPDS items are related to the construct of depressive symptomatology in a similar way, one should take caution in interpreting mean differences across antepartum and postpartum groups. On a practical level, this means that a score of X at prepartum does not necessarily indicate the same level of depressive symptoms as a score of X at postpartum, but a change of ±Y points likely indicates the same change in both groups.
Our results concerning the factor structure of the EPDS are in line with the only previous Italian study on the topic (Della Vedova et al. 2022). However, they are inconsistent with most of the international literature which has found a three-factor solution (e.g., Coates et al. 2017;Kubota et al. 2018;Long et al. 2020). Differences in factor number and composition may plausibly depend on differences in cultural and/or language features. In fact, culturally sensitive cutoff values for the EPDS have been recommended, and they vary considerably, ranging from nine to fourteen for different populations (Halbreich & Karkun 2006;Smith-Nielsen et al. 2018). Such differences are likely owing to cultural variations in the attributions and expressions of depressive symptoms and the language used to describe them (Haroz et al. 2017;Lara-Cinisomo et al. 2020).
Regarding PHQ-9, our findings suggest a two-factor structure model in the antenatal group. Unlike the EPDS, only very few studies have thus far investigated the factor structure of the PHQ-9 in perinatal samples. Different factor structures were found during the antepartum period, and it seems plausible that these differences stem from cultural differences. Two studies involving Peruvian pregnant women agreed on indicating the same two-factor solution with the same items assigned to each scale (Smith et al. 2022a,b;Zhong et al. 2014). Similarly, a Japanese study found a two-factor model but with different assignments of items to scales (Wakamatsu et al. 2021). Further two studies involving Ethiopian versus Ivorian and Ghanaian pregnant women suggested a one-factor structure (Barthel et al. 2015;Woldetensay et al. 2018). Finally, a three-factor model (cognitiveaffective, somatic, and pregnancy-related) was considered adequate to screen depression in Spanish pregnant women (Marcos-Nájera et al. 2018). To our knowledge, no studies except ours have examined the factor structure of the PHQ-9 in postpartum samples.
A recent systematic review and meta-analysis on screening for perinatal depression identified 15 studies providing psychometric comparisons between the EPDS and PHQ-9 and found that their operating characteristics of sensitivity, specificity, and area under the curve were remarkably similar (Wang et al. 2021). However, this study focused on the diagnostic accuracy of these scales rather than their psychometric properties. The present study offers important new evidence about the measurement invariance of these scales across the perinatal period which can inform the choice of which scale to use in clinical practice and research.
The different performances observed between the PHQ-9 and EPDS, especially in the postpartum group, support a possible partial explanation that they capture partially distinct features of depressive symptomatology. In fact, growing evidence indicates that genetic etiologies for perinatal depression overlap only partially with those for non-perinatal depression (Viktorin et al. 2016) and that there exist different types and severities of perinatal depression (Putnam et al. 2017). Only depression occurring in the later postpartum period (i.e., after the 8th week postpartum) seems to be more similar to a major depressive disorder occurring outside of the perinatal period (Batt et al. 2020). It is therefore possible that the main differences are likely related to the specific development of the two scales. The EPDS was specifically devised for postpartum depression using items drawn from three scales for anxiety and depression [i.e., the Irritability, Depression, and Anxiety Scale (Snaith et al. 1978), the Hospital Anxiety and Depression Scale (Zigmond & Snaith 1983), and the Anxiety and Depression Scale (Bedford et al. 1976)], and deemphasizing the somatic symptoms that might overlap with depressive symptoms even when they should be considered normative during postpartum. The PHQ-9 was instead developed specifically to identify depressive disorders based on DSM-IV criteria and was derived from the Primary Care Evaluation of Mental Disorders (PRIME-MD; Spitzer et al. 1994), which was originally devised to identify mood, anxiety, somatoform, alcohol, and eating disorders in the general population. As a result, in both scales, some items are not entirely consistent with the depressive dimension; the PHQ-9 includes items addressing somatic symptoms, whereas the EPDS includes items addressing anxiety. This is a key difference because, on the one hand, somatic symptoms are strongly experienced by perinatal women, even if they are not clinically depressed (Pereira et al. 2014), and the presence of somatic symptoms during antenatal depression predicts postpartum depressive symptoms even if these symptoms have subsided (Roomruangwong et al. 2017). On the other hand, besides depressive symptoms, anxiety is the most common psychological symptom observed in both pregnant people and new mothers Cena et al. 2021aCena et al. , 2021bNakić Radoš et al. 2018).

Strengths and limitations
The strengths of the present study include the use of a large perinatal sample and several clinical centers located throughout Italy. Furthermore, this study used multigroup confirmatory factor analysis to assess measurement invariance across the perinatal period-the first paper that we know of to apply this modern psychometric approach to compare the EPDS and PHQ-9. Finally, this is the first study to examine the factor structure of the Italian version of the EPDS in an antepartum sample, as well as the first to examine the factor structure of the Italian version of the PHQ-9 in a perinatal sample. However, there are also some noteworthy limitations. Firstly, the cross-sectional design precludes the evaluation of the test-retest reliability of the scales. Another limitation regards the fact that the factor structure of both the EPDS and the PHQ-9 across trimesters was not examined. Lastly, because our sample population was entirely composed of people living in Italy, it may not be representative of other country populations.

Conclusion
In conclusion, in the present study, the Italian version of the EPDS demonstrated reliability but weak (i.e., factor loadings equated) measurement invariance across antepartum and postpartum groups. In contrast, the Italian version of the PHQ-9 showed adequate performance with pregnant people but had inconsistent factor loadings and poor model fit with postpartum people. Therefore, we conclude that the EPDS should be preferred to the PHQ-9 for measuring depressive symptoms in the perinatal population but should be used with caution when comparing antepartum to postpartum scores. Lastly, we recommend that both the EPDS and PHQ-9 can be used as a single-factor scale.
Author contribution Alberto Stefana: conceptualization, formal analysis, writing the original draft, and writing the review and editing. Joshua A. Langfus: formal analysis, writing the original draft, and writing the review and editing. Gabriella Palumbo: writing the review and editing. Loredana Cena: project administration and writing the review and editing. Alice Trainini: data curation and writing the review and editing. Antonella Gigantesco: supervision and writing the review and editing. Fiorino Mirabella: formal analysis, supervision, and writing the review and editing.
Funding Open access funding provided by Università degli Studi di Pavia within the CRUI-CARE Agreement. The work of the first author was supported by a Marie Sklodowska-Curie global fellowship from the European Union's Horizon 2020 research and innovation programme (grant agreement no. 101030608).

Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflict of interest The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.