Introduction

Postpartum depression (PPD) is one of the most common medical complications during and after pregnancy, resulting in a substantial health-related quality of life burden for mothers, children, and partners (Callaghan et al. 2010; Centers for Disease Control and Prevention 2018; Centers for Disease Control and Prevention 2019; De Sisto et al. 2014; Ko et al. 2017; Martin et al. 2019; Moore Simas et al. 2019; Roberts et al. 2013; Vismara et al. 2016). In the USA, an estimated 11.5% of new mothers experience symptoms of PPD, with global estimates of 17.7% (Hahn-Holbrook et al. 2017; Ko et al. 2017).

Clinicians treating perinatal women commonly use the Edinburgh Postnatal Depression Scale (EPDS) or Patient Health Questionnaire (PHQ)-9 as PPD screening tools. Both the EPDS and PHQ-9 are patient-reported instruments and therefore provide unique perspective into the patient’s experience of their symptoms. Given the increasing focus on patient-centred care, understanding and incorporating patient’s voice into the assessment of health outcomes can facilitate shared decision-making, early diagnosis, individualized care, and patient satisfaction with treatment. The EPDS is a validated screening tool for PPD widely used to assess the likelihood of clinical depression and recently recommended by the American College of Obstetricians and Gynecologists (ACOG Committee Opinion No. 757 2018; Cox et al. 1987; Smith et al. 2016). Although the EPDS was developed as a screening tool, it is often used as an outcome measure in PPD studies and clinical practice (Appleby et al. 1997; McCabe-Beane et al. 2016; Sharp et al. 2010). The EPDS has been shown to perform well when compared to clinician-reported tools such as structured clinical interviews and more consistently than other patient-reported tools such as the Beck Depression Inventory (Beck et al. 1988; Bennett et al. 2004; Ji et al. 2011; Spinelli et al. 2013). The PHQ-9 is a general depression screening questionnaire that is also commonly used to screen for PPD, particularly in primary care settings, has been found to be highly specific for identifying PPD, and has comparable psychometrics, sensitivity, and specificity to the EPDS (Flynn et al. 2011; Gjerdingen et al. 2009; Kroenke et al. 2001; O'Connor et al. 2016; Sidebottom et al. 2012; Sit and Wisner 2009; Zhong et al. 2014).

The 17-item Hamilton Rating Scale for Depression (HAMD-17) is widely used as a primary endpoint in major depressive disorder (MDD) and PPD clinical trials and is commonly required by regulatory bodies (e.g. Clayton et al. 2015; Hantsoo et al. 2014; Papakostas et al. 2012). As an established clinician-reported outcome measure, it is regarded as the gold standard assessment in clinical trials, demonstrating stability over time and allowing for comparison of treatment effects between studies (Ji et al. 2011; Khan et al. 2002; Kim et al. 2014). However, given its complexity and length of time to complete, the HAMD-17 is less commonly used to assess depression in clinical practice. While EPDS and PHQ-9 severity group cut-off scores have been reported (Ji et al. 2011; Kroenke et al. 2001; McCabe-Beane et al. 2016), no mapping of EPDS or PHQ-9 scores to predict HAMD-17 scores has previously been reported. Understanding the direct relationship between patient-reported outcomes as measured either by the EPDS or by the PHQ-9 and the clinician-reported HAMD-17, based on evidence from a PPD-specific population, could further validate these screening tools in the identification of patients in need of treatment given symptom severity and enhance early diagnosis in this vulnerable population.

The present post hoc analysis aimed to bridge the existing interpretation gap and examine the relationship between the EPDS, PHQ-9, and HAMD-17 instruments. These post hoc analyses used an integrated dataset combining the data from three randomized, double-blind, trials of intravenous brexanolone compared to intravenous placebo in women with moderate and severe PPD (Kanes et al. 2017; Meltzer-Brody et al. 2018a). More specifically, using the HAMD-17 as the criterion measure, this study aimed to (1) assess the association between the current EPDS and PHQ-9 definitions of remission and the HAMD-17 definition of remission, (2) assess the association between the EPDS and PHQ-9 item and total scores with the HAMD-17 total score, (3) develop a mapping equation to permit the estimation of HAMD-17 scores from EPDS and PHQ-9 scores.

Methods

Study design and participants

The current report utilizes data from phase 2 and phase 3 clinical trials, which examined the safety and efficacy of intravenous brexanolone compared with intravenous placebo in patients with moderate and severe PPD (clinicaltrials.gov identifiers NCT02614547, NCT02942004, and NCT02942017). Full descriptions of the trial designs and inclusion and exclusion criteria have been published previously (Kanes et al. 2017; Meltzer-Brody et al. 2018a). Briefly, one phase 2 (study A) and two phase 3 (studies B and C) multicentre, randomized, double-blind, placebo-controlled trials were conducted across the USA under an umbrella protocol. All trials enrolled female participants; aged 18–45 years; in good physical health; had a major depressive episode with onset no earlier than the third trimester and no later than 4 weeks after delivery, as determined by the Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorders-IV Axis I Disorders (First et al. 1996); had a HAMD-17 total score ≥ 26 (severe PPD, studies A and B) or 20–25 (moderate PPD, study C); and were 6 months postpartum or less at screening. Patients were either randomized 1:1:1 to receive 60-h infusion of brexanolone 90 μg/kg/h, brexanolone 60 μg/kg/h, or placebo (study B), or 1:1 to receive 90 μg/kg/h brexanolone or placebo (studies A and C). Patients who received the recommended target dose for brexanolone, 90 μg/kg/h, or placebo were included in the integrated efficacy dataset, employed for these analyses.

Patients were treated in a medically supervised setting for 72 h: 60 h of study drug infusion and 12 h for completion of assessments. Patients were followed until day 30, with clinical and safety assessments at days 7 and 30. The EPDS, PHQ-9, and HAMD-17 were administered at baseline, hour 60, day 7, and day 30 for all patients. The primary endpoint in each trial was the least-squares mean difference in change from baseline in HAMD-17 total score at hour 60.

Measures

The present analysis included three clinical outcome assessments: the patient-reported EPDS, patient-reported PHQ-9, and clinician-reported HAMD-17. The psychometric properties of these scales have been extensively described in the literature (Bagby et al. 2004; Flynn et al. 2011; Hamilton 1960; Kroenke et al. 2001; McCabe-Beane et al. 2016; Zhong et al. 2014).

The EPDS is a 10-item patient-reported outcome measure of depressive symptom severity specific to the perinatal period, originally developed as a screening tool. The EPDS has a recall period of 7 days. Items are scored on a 0–3 scale and summed to compute a total score, which ranges from 0 to 30, with higher scores indicating more severe depression. A score of 10 has been recommended as a cut-off point for the presence of minor PPD, and score of 13 and above for major PPD in English-speaking women (Cox et al. 1987; Harris et al. 1989; Murray and Carothers 1990). Remission was defined as an EPDS score of < 10 in the present analysis.

The PHQ-9 is a 9-item patient-reported outcome and depression screening tool. The PHQ-9 has a recall period of 14 days. Items are scored on a 0 (not at all) to 3 (nearly every day) scale and summed for a total score ranging from 0 to 27. Remission or minimal symptom severity has been defined as a score of < 5 (Kroenke et al. 2001).

The HAMD-17 is clinician-reported scale that evaluates core symptoms of depression. Items are scored on a 0 (none/absent) to 4 (most severe) or 0 (none/absent) to 2 (severe) scale (Hamilton 1960). Individual item scores are summed to compute the total score, which ranges from 0 to 52, with higher scores indicating more severe depression. Clinical remission is defined as a HAMD-17 total score of ≤ 7.

Statistical analysis

The analysis was conducted on the integrated efficacy dataset. The efficacy dataset included all randomized patients who started the infusion of study drug and who had a valid baseline HAMD-17 assessment and at least one post baseline HAMD-17 assessment. As the objective of this analysis was to examine the relationship between the EPDS, PHQ-9, and HAMD-17, data was pooled across treatment arms. Furthermore, only the day 30 assessment point data (end of study) was utilized. Due to study inclusion criteria, the data at baseline and earlier assessment points showed less variability, and variability is required when trying to establish a relationship between measures (Salkind 2010).

Patients with day 30 EPDS, PHQ-9, and HAMD-17 data were included in the analytical sample for the present analysis. The sample characteristics were summarized descriptively.

To examine the association EPDS remission (score < 10) and PHQ-9 remission (score < 5) with HAMD-17 remission (score ≤ 7), the proportions of EPDS and PHQ-9 remitters were compared with the proportion of HAMD-17 remitters using Fisher’s exact test; concordance was tested using the Kappa coefficient. Kappa were interpreted as poor (< 0), slight (0–0.2), fair (> 0.2–0.4), moderate (> 0.4–0.6), or substantial (> 0.6–0.8) (Landis and Koch 1977).

Pearson correlations were calculated to examine the association of EPDS and PHQ-9 total scores with HAMD-17 total score. As a validity check, as well as to inform multivariable modelling, the correlations between the EPDS and PHQ-9 items and the HAMD-17 total score were assessed for inconsistent or disproportionate values. The correlations were interpreted as small (0.3–< 0.5), moderate (0.5–< 0.7), or large (≥ 0.7) (Hinkle et al. 2003).

Ordinary least squares regression models were used to develop equations to estimate HAMD-17 total score from EPDS and PHQ-9 total score, respectively. For each screening tool, three models were applied: (1) a simple regression of the HAMD-17 and EPDS/PHQ-9 total score (model 1), (2) model 1 with a quadratic term added (EPDS/PHQ-9 total score squared; model 2), (3) a stepwise fivefold cross-validation predicted residual error sum of squares (CV PRESS) model using EPDS/PHQ-9 items to predict HAMD-17 total score (model 3). For model 3 using cross-validation methods as a means of preventing overfitting, cross-validated covariates were selected from the 10 EPDS items and 9 PHQ-9 items, using CV PRESS as the stepwise selection and stop criteria. Age and a combined PPD history/nulliparity variable (first pregnancy; pregnancy history with PPD, pregnancy history without PPD) were initially included as covariates but were removed as found non-significant to keep the models as simple as possible for potential use in a clinical practice setting.

Two-sided tests were used and p values of ≤ 0.05 were considered statistically significant. Statistical significance was not adjusted for multiple comparisons. SAS version 9.4 (SAS Institute Inc., Cary, NC, USA) was used for all statistical analysis.

Results

In total, 199 patients provided EPDS, PHQ-9, and HAMD-17 data at day 30 and were thus eligible for the present analysis. While all analyses were conducted on the pooled sample, the patient demographics and baseline characteristics are presented overall and by treatment arm for information (Table 1). Of the 199 patients included in these analyses at baseline, most patients were white (121, 61%), 74 (37%) were black or African-American, and 35 (18%) were Hispanic or Latino. A higher proportion of patients had onset of PPD within 4 weeks of delivery than the third trimester. The mean EPDS total score at day 30 was 9.04 (SD = 6.84), the mean PHQ-9 total score was 6.76 (SD = 6.52), and the mean HAMD-17 total score was 10.56 (SD = 8.47), with minimum and maximum scores 0–27, 0–25, and 0–31, respectively.

Table 1 Baseline demographics and characteristics

Association EPDS and PHQ-9 remission with HAMD-17 remission

There were statistically significant associations between EPDS and PHQ-9 definitions of remission with HAMD-17 remission (Table 2), both with moderate Kappa agreement. Among patients classified as remitters according to the HAMD-17 definition, 79% and 76% were also classified as remitters by the EPDS and PHQ-9 definitions, respectively (sensitivity). Among patients classified as a non-remitter on the HAMD-17, 67% and 78% were also classified as such by the EPDS and PHQ-9, respectively (specificity).

Table 2 Associations between HAMD-17 remitters and EPDS and PHQ-9 remitters at day 30

Association of EPDS and PHQ-9 total and item scores with HAMD-17 total score

All correlations of EPDS and PHQ-9 total and items scores with HAMD-17 total score were statistically significant (ps < 0.001; Table 3). The total scores showed large correlations (EPDS/HAMD-17: r = 0.71; PHQ-9/HAMD-17: r = 0.75). Item scores were moderately correlated with HAMD-17 total score except ‘thought of self-harm’ for both EPDS and PHQ-9 and ‘able to laugh’ and ‘looked forward with enjoyment’ for EPDS.

Table 3 Pearson correlations between the HAMD-17 Total and EPDS and PHQ-9 total and items at day 30 (all p’s < 0.001)

Estimating HAMD-17 total score from EPDS and PHQ-9

As shown in Table 4, the additional complexity of the quadratic term (model 2) and CV PRESS item-level model (model 3) did little to improve the variance explained statistics or reduce the prediction error, making model 1 the most parsimonious and recommended model to estimate HAMD-17 total score for both the EPDS and PHQ-9. Age and PPD/nulliparity were not significant predictors in any models (p > 0.05) and were thus removed. Based on model 1, the following equation can be used to estimate HAMD-17 total score from EPDS total score: HAMD-17 total = 2.66 + (EPDS total × 0.87). HAMD-17 total score can be estimated from PHQ-9 total score with a similar equation: HAMD-17 total = 3.99 + (PHQ-9 total × 0.97). This results in an integer range of estimated HAMD-17 total scores of 3–29 from the EPDS [2.66 + (0 / 30 × 0.87)] and 4–30 from the PHQ-9 [3.99 + (0 / 27 × 0.97)], very much aligned with the observed range of HAMD-17 total score (0–31).

Table 4 Regression models to predict HAMD-17 total scores from EPDS and PHQ-9 scores

Discussion

PPD is frequently unrecognized, undiagnosed, and inadequately treated (Cox et al. 2016). A more nuanced understanding of how commonly used screening tools such as the EPDS and PHQ-9 relate to clinical trial outcomes, such as the HAMD-17, may help address failures in diagnosis and treatment. The observed large associations between EPDS and PHQ-9 total score and HAMD-17 total score, as well as moderate associations of remission definitions, suggest that patients’ self-reports of their symptoms generally align with clinician-ratings, such as the HAMD-17, that may be prohibitively time and training intensive to obtain and further validate the ability of these screening tools to identify patients with PPD symptoms. Once a patient is diagnosed, management generally follows a stepped care approach, wherein patients with mild symptoms are treated through low-intensity interventions and those with symptoms that do not respond, are more severe, or present acutely are typically offered pharmacologic interventions, either alone or adjunctive to low-intensity interventions (Meltzer-Brody et al. 2018b). The observed pattern of results in these analyses, wherein higher scores on EPDS and PHQ-9 are associated with more severe symptoms as assessed by HAMD-17, may help clinicians who have identified and diagnosed PPD patients to understand and discuss with the patients what treatments are likely appropriate. Primary care physicians, obstetricians, and gynaecologists and even paediatricians who administer an EPDS or PHQ-9 may have a better ability to recognize which women are likely to receive benefit from low-intensity interventions, such as psychosocial support, and those that may warrant prompt referral for more intensive care. Clinicians seeking to practice shared decision-making regularly need to draw on and interpret a high volume of new clinical evidence in order to describe and discuss available options, alongside possible benefits or harms (Price 2005). Being able to understand and translate the relationship between the HAMD-17 data commonly reported as evidence of treatment efficacy in clinical trials and the EPDS or PHQ-9 data being captured in clinic will hopefully assist clinicians in more confidently and readily interpreting trial evidence to present during shared decision discussions. Given limitations in healthcare resources and availability of specialists, confidence in the use of these tools for valid screening and treatment decision-making may also help ameliorate stresses to the healthcare system.

The large associations observed between total scores are notable given that the measures differ in terms of patient versus clinician-rating, as well as in content, with EPDS designed specifically for the perinatal period, PHQ-9 initially designed for general depression, and HAMD-17 designed to assess changes in symptom severity. The moderate associations observed between EPDS and PHQ-9 item scores and HAMD-17 total score are in line with expectations, as individual item scores are expected to show more variability than summed total scores. Furthermore, as an advance to the current reliance upon individual outcome measure cut-off values to interpret scores, the current study also conducted regression analysis to provide two simple mapping equations that can easily be applied in any clinical setting.

Limitations

While the present analysis advances current understanding and interpretation of EPDS, PHQ-9, and HAMD-17, these analyses are based on a clinical trial population, and the generalizability to the broader population should be examined through validation of this work in an external dataset. For example, it may be of interest to validate the mapping equations on a sample including a broader range of HAMD-17 total score. While the current sample included EPDS and PHQ-9 total score from almost the full range, 0–25/27 and 0–27/30, respectively, the HAMD-17 total score range was 0–31/52. However, as any HAMD-17 total score higher than 25 indicates severe depression, the full range of depression severity required for clinical decision-making was covered in the present analysis. Indeed, the maximum baseline HAMD-17 total score in the severe PPD patients included in this analysis was 38, as has been reported in severe PPD elsewhere (Nonacs et al. 2005), suggesting that the higher values may be less common in PPD. The baseline data was not included in the analysis reported here due to the restriction in range of PHQ-9, EPDS, and HAMD-17 total score as a result of the trial inclusion criteria for moderate/severe PPD.

Conclusions

The present post hoc analysis demonstrates large and statistically significant associations between patient-reported EPDS and PHQ-9 scores and the clinician-reported HAMD-17 and in an integrated efficacy dataset of PPD trials and provides the first mapping equations to permit the estimation of HAMD-17 total score from known EPDS or PHQ-9 scores. Given the role of the HAMD-17 as the gold standard criterion for evaluating the efficacy of treatments in clinical trials, the results presented here provide tools to aid interpretation of clinical trial data in clinical practice, thus aiding informed decision-making for multiple stakeholders in regulatory and clinical settings and shared decision-making in clinical practice for this critical population of women with PPD.