1 Introduction

1.1 High rates of symptoms of mental health problems in Japan

Mental health symptoms such as depression and anxiety are the leading causes of health burden worldwide [1]. The global prevalence of mental health symptoms accounted for 655 million estimated cases in 1990, and 970 million cases in 2019, yielding an increase of 48% [2]. For decades, high rates of mental health symptoms have been consistently reported in Japan [3], with over half of Japanese people experiencing mental health symptoms [4].

Healthcare workers (HCWs) suffer from high rates of mental health symptoms. A systematic review and meta-analysis reported that the prevalence of burnout, depression, and anxiety among HCWs across 86 studies was 32%, 26%, and 25% respectively [5]. Mental health symptoms in HCWs can lead to diverse negative outcomes including poor patient care, treatment errors, and turnover intentions [6,7,8]. These negative outcomes often have financial implications such as increased treatment costs and staff recruitment/development costs [9, 10].

The coronavirus disease (COVID-19) pandemic has worsened the already-challenged mental health of HCWs worldwide, including in Japan [6, 11]. Studies conducted during the pandemic show that a wide range of mental health symptoms worsened [12], including elevated levels of stress [13, 14], depressive symptoms [15], insomnia [6], burnout [12], suicidal ideation [16], and loneliness [17]. For example, 50% of frontline HCWs experienced burnout [18]. Moreover, 62% of public health officials (e.g., public health nurses) suffered from job-related stress [15].

1.2 The patient health questionnaire-4

The Patient Health Questionnaire-4 (PHQ-4) is an ultra-brief scale used to screen for anxiety and depression. It combines two items from the Patient Health Questionnaire-9 (PHQ-9; for symptoms of depression) and two items from the Generalized Anxiety Disorder-7 (GAD-7; for nervousness and uncontrollable worry) [19]. The PHQ-4 is a widely used, and highly reliable mental health measure, validated with a vast range of general population groups including people in Colombia [20], Germany [21], Greece [22], the Philippines [23], and in English and Spanish speaking Hispanic Americans [24]. Additionally, the scale has performed well with various age, gender, partnership status, and socioeconomic groups [25], and in different languages for refugees and migrants [26, 27].

PHQ-4 has also been validated in diverse clinical samples including pregnant women [28], pre-operative surgical patients [29], out-of-school adolescent girls and young women in Tanzania [30], and primary care setting patients [31]. The scale also correlates well with other mental health outcomes such as well-being [32], self-efficacy and life satisfaction [20]. Overall, evidence consistently supports the use of PHQ-4 for screening symptoms of depression and anxiety in a variety of global settings. However, to the best of our knowledge, PHQ-4 has not been validated in the Japanese language with a sample in Japan. Considering the high rates of mental health symptoms in Japan, detection of the symptoms and regular assessments are essential. The Japanese version of PHQ-4 (PHQ-4-J) needs to be validated.

1.3 Study aim

This study aimed to validate the PHQ-4-J. To achieve this, we investigated the reliability and validity of the PHQ-4-J, and its subscales PHQ-2 and GAD-2 in a sample of Japanese people. Specifically, we evaluated of the PHQ-4-J item characteristics, internal consistency, and factorial validity.

2 Material and methods

2.1 Study design and sample

This is a post-hoc analysis of a longitudinal study evaluating changes of the mental health status in Japan during the pandemic [33]. HCWs and the general population were approached in Facebook groups in June 2020. Facebook is one of the most globally accessed social media platforms. In Japan there are 26 million active users, accounting for 12% of the population [34]. Two groups (> 1000 followers in June 2020) were chosen for recruitment because they were (a) active and well-managed (e.g., no abusive or discriminatory language used), and (b) none of the authors were a well-known figure in the group, limiting the biases. A survey link was embedded in the message we shared in the two groups. After consent was gained, participants were asked to complete the mental health scales. The survey was open for four weeks. Ethical approval was obtained from the University of Derby Research Ethics Committee (ETH1920-2929). Informed consent was obtained from all participants. No financial incentive was offered for participation.

2.2 Measure

The original English version of the PHQ-4 is a validated measure asking how often you have been bothered by the following in the past two weeks: (1) Feeling nervous, anxious or on edge, (2) Not being able to stop or control worrying, (3) Feeling down, depressed or hopeless, and (4) Little interest or pleasure in doing things. The first two items assess anxiety, and the last two items assess depression. Each item is responded on a four-point Likert scale (0 = ”Not at all” to 3 = ”Nearly every day”). The total score is a sum of the four items (Range 0–12). Scores are interpreted as “normal” for 0–2, “mild” for 3–5, “moderate” for 6–8, and “severe” for 9–12 [19].

The PHQ-4 was first translated by YK, who is a professional translator between English and Japanese. The initial translation was back-translated into English by another bilingual researcher. No major difference was detected. The final version of the PHQ-4-J (please see Electronic Supplementary Material 1) was reviewed by the authors AO and HM, who were also English-Japanese bilinguals, to ensure that the wording was easy to understand by many Japanese people [35] and the content equivalence was maintained [36].

2.3 Data analysis

The item characteristics of the PHQ-4-J were calculated as measures of scale validity. These included the corrected item-total correlations, inter-correlations of items from the same subscale (J-PHQ-2, J-GAD-2), item-inter-correlations with items from the other subscale, and the intercorrelations between subscales and between each subscale and total PHQ-4-J scores. Since PHQ-4-J scores were not normally distributed, correlations were calculated using Spearman’s rho (ρ). Internal consistency was evaluated using Cronbach’s α.

We used confirmatory factor analysis (CFA) to verify the known one- and two-factor structures of the PHQ-4. The comparative fit index (CFI), the Tucker-Lewis-Index (TLI), and the root mean square error of approximation (RMSEA) were the indices used to assess the model fit of the CFA. A CFI and TLI value > 0.95 [37], and RMSEA values < 0.10 indicate a good model fit [38]. Multiple Indicators Multiple Causes (MIMIC) models were used to examine associations between observed variables and latent variables [39]. Specifically, MIMIC models were used to assess associations between age group (≥ 34 years versus < 34 years, based on median split), gender, and group (general population versus HCWs) and scores on the PHQ-4-J (one-factor model), and the J-PHQ-2 and J-GAD-2 (two-factor model).

Data analyses were conducted in STATA 17.0 (Stata Corp LLP, College Station, TX).

3 Results

3.1 Sample characteristics

The sample used to evaluate the item characteristics, the internal consistency, and the factorial validity of the PHQ-4-J comprised 280 people. The mean age of the sample was 34.7 years (SD = 12.4) and 66.8% identified as female. 50.7% of the sample were HCWs. Median, mean, standard deviation, skewness, and kurtosis of PHQ-4-J, J-PHQ-2, and J-GAD-2 scores for the sample are provided in Table 1. The mean age of this subsample was 33.9 years (SD = 11.7) and 60.4% were female. 46.3% of this subsample were HCWs. Table 2 presents the frequency of use of each response per item.

Table 1 PHQ-4-J, J-PHQ-2, and J-GAD-2 scores
Table 2 Frequency of use of each response per item

3.2 Item characteristics

Corrected item-total correlations (i.e., correlations between a given item and the total PHQ-4-J sum score with that item removed) ranged from ρ = 0.57 to 0.63. Inter-correlations of items from the same subscale were ρ = 0.70 for the J-PHQ-2 and ρ = 0.48 for the J-GAD-2. The item-inter-correlations with items from the other subscale ranged from ρ = 0.43 to 0.55. The intercorrelation between subscales was ρ = 0.61 and correlations with total PHQ-4-J scores was ρ = 0.90 for the J-PHQ-2 and ρ = 0.89 for the J-GAD-2. All correlations were significant at p < 0.001. Tables 3 and 4 present these item characteristics.

Table 3 Corrected item total correlations
Table 4 Inter-correlations of items from the same subscale, item intercorrelations with items from the other subscale, intercorrelations between subscales and between subscales and total scores

3.3 Internal consistency

Cronbach’s α for the PHQ-4-J was 0.84. The Cronbach's α, if item dropped, varied from 0.75 to 0.83. Internal consistency for the J-PHQ-2 and the J-GAD-2 were 0.86 and 0.70 respectively.

3.4 Factorial validity

We used CFA to test the one- and two-dimensional structure of the PHQ-4-J. Factor loadings for the one-factor model were high (0.62–0.92). All indices except RMSEA indicated a good model fit for the one-factor model (CFI = 0.97, TLI = 0.90, RMSEA = 0.17, 95% CI 0.11–0.25). The two-factor model had an improved model fit (CFI = 0.99, TLI = 0.99, RMSEA = 0.04, 95% CI 0.00–0.17), with factor loadings ranging from 0.72 to 0.95. The MIMIC models for both the one- and two-factor structures are displayed in Fig. 1 (Age range scores are in Electronic Supplementary Material 2). Age and gender were not associated with scores on the PHQ-4-J, J-PHQ-2, or J-GAD-2. Group (i.e., general population versus HCW) was associated with the PHQ-4-J (0.16, p = 0.012) which indicated that HCWs had higher depressive symptoms. This suggests that the performance of the PHQ-4-J varied between the general population and HCWs in Japan. Looking at the J-PHQ-2 and J-GAD-2 separately revealed that HCWs had higher scores in both, but this was more pronounced for the J-GAD-2 (J-PHQ-2: 0.12, p = 0.046; J-GAD-2: 0.27, p < 0.001).

Fig. 1
figure 1

Multi-group MIMIC model for one- and two-dimensional structures. *p < 0.05; **p < 0.001

4 Discussion

Our study aimed to validate the PHQ-4-J. The results targeting the reliability and validity of PHQ-4-J were promising. The results of the current study indicate that it is a participant-friendly and reliable mental health measure in the Japanese language.

4.1 Implications for research

The PHQ-4-J can address some of the existing mental health research problems in Japan. One major area of concern is low response rates. For example, a nationwide survey, the World Mental Health Japan Survey, suffered from poor response rates: 55% for the 1st survey, and 43% for the 2nd survey [40]. This survey lasts an average of two hours [41]. Low response rates may lead to sample biases, because people who complete the measure may have a tendency that is relevant to the measured symptoms [42]. That is, completers of a lengthy mental health survey tend to be those who have experienced mental health symptoms, perhaps in turn making them more willing to invest their time in the survey [43, 44]. The PHQ-4-J, at only four items long, comprises little participant burden and therefore can address this problem.

The PHQ-4-J can also contribute to disaster research. Japan has experienced many types of natural disasters including earthquakes, floods, and typhoons. However, the mental health of people in Japan, including those who offer care, during these events remains under-researched [45]. This is important to address, because while the psychological impacts of these events are long-lasting, research has not been able to assess them [46]. An ultra-brief scale of the PHQ-4-J can help rectify this problem. People in a natural disaster are more likely to complete and repeat this four-item scale, compared to longer, more burdensome scales.

The proportion of older people (≧65 years old) has been rapidly increasing in Japan. In 2022, 29% of the national population were 65 years old or older, and this is set to increase [47]. Therefore, assessing older people’s mental health will become more prevalent. Gerontology research recommends the use of short scales for older people to reduce burden and assist their participation [48]. The PHQ-4-J is a short, and easy-to-understand scale, and therefore would likely be appropriate for use by this increasing population group.

The mean scores of our general population sample were similar to, and of our HCW sample were higher than those of the US sample (primary-care patients) in the original PHQ-4 development paper: Anxiety (two items) 1.4 ± 1.7, Depression (two items) 1.0 ± 1.4, and the total 2.5 ± 2.8 [49]. Consistent with other COVID studies [50,51,52], this may highlight the heightened mental distress among HCWs during the COVID pandemic. Additionally, the mean scores of our general population sample were lower than those of the US sample of general population during COVID: Mean and SD for anxiety (two items) 1.67 ± 1.97, for depression (two items) 1.60 ± 1.89, and for all four items 3.28 ± 3.67 [53]. The difference may be partly explained by the resilience of Japanese people to emergencies, including natural disasters (e.g., as seen in the “Bosai Culture” of Japan, referring to the ingrained attitudes and systems focused on disaster preparedness and resilience) [54, 55]. Many Japanese residents experience various natural disasters such as earthquakes, tsunamis, and floods, which may have made them mentally less affected by the pandemic compared to US residents [56]. Another explanation may be response biases derived from cultural differences [57]. Self-report measures can be susceptible to response biases [58]. People in the USA tend to give more extreme responses than people in Japan [59]. Moreover, shame towards mental health problems tends to be strong among Japanese people [60], relating to their perspectives to mental health [61]. For the global use of the PHQ-4, these differences need to be further evaluated [62].

Lastly, as the PHQ-4-J is a screening tool, the sensitivity and specificity to detect mental health problems need to be evaluated in future research. The sensitivity refers to how correctly the tool identifies a large proportion of individuals who actually have mental health problems, and specificity refers to how correctly the tool identifies a large proportion of individuals who do not have mental health problems [63]. These are essential for the effectiveness of screening tools [64].

4.2 Implications for practice

The PHQ-4-J can help mental health practice in four ways. Firstly, as a concise and non-intrusive questionnaire, the PHQ-4-J provides an easy approach for a HCW to open up a dialogue with the patient about their mental health. This is critical as mental health is a sensitive subject in Japan [57]. The PHQ-4-J can offer a safer way to discuss mental health with patients, especially those who are not seen for mental health concerns, as they may be more reluctant to talk about these issues [65].

Secondly, little patient burden to complete the scale can benefit HCWs too. HCWs often support patients to complete an assessment by responding to patient questions arising from the scale. The ultra-brief PHQ-4-J would require less HCW support for patients. HCWs in Japan are chronically understaffed [66]. For example, the number of nursing staff allocated to bed is markedly lower in Japan than in other countries: Japan 0.9, the UK 3.1, Canada 3.9, and the US 4.1 per bed [67]. Especially since the outbreak of the COVID-19 pandemic, the number of people applying to healthcare roles has decreased, for reasons such as high workload. At the current pace, the healthcare workforce in Japan is expected to lack about two million workers by 2030 (12 million employed for 14 million needed) [68]. Little HCW burden to support patients completing the scale is also helpful in practice, reducing HCW workload.

Thirdly, the PHQ-4-J can be used at an intake session for early detection of mental health symptoms. Early detection of mental health symptoms is associated with improved patient outcomes, as appropriate treatments or referrals can happen sooner [69]. This prevents symptoms becoming worse with the potential development of more severe mental health symptoms [70].

Lastly, the PHQ-4-J is conducive to personalised treatment. A need for personalised treatment is increasing in many countries including Japan [71, 72]. To assess the impact of personalised treatment, regular assessments of patient mental health are needed [73]. Regular assessments also help identify possible adverse effects of the treatment. The PHQ-4-J is more well-suited for regular, repeated assessments, informing a development of personalised treatment.

4.3 Limitations

Several limitations need to be noted. First, the use of Facebook groups for recruitment may have caused sample selection bias such as judgmental and/or convenient sampling biases. Second, although our sample size is regarded as a “good” size [74], a larger sample would yield more generalisable results. Moreover, more specific populations instead of the general population could be explored in future research. We were unable to assess the construct validity of the PHQ-4-J and potential clinical cut-offs for depression and anxiety. Future research should seek to co-administer other measures of depression and anxiety alongside the PHQ-4-J, as well as gather information about mental health diagnoses in order to assess these aspects of the scale.

5 Conclusion

We validated the PHQ-4-J from HCWs and the general population in Japan. The PHQ-4-J has several research and practice implications, suggesting potential for the high utility of the scale in Japan. In research, the PHQ-4-J can help response rates by reducing burden, minimising, and also might lend itself to use in disaster research. In practice, the PHQ-4-J can help facilitate conversations about mental health with patients, HCW workload, early detection, and personalised treatment. Though the PHQ-4-J still needs to be tested in more specific populations, our findings provide evidence that the PHQ-4-J is a reliable ultra-brief scale for depression and anxiety in Japanese language which can be used to address current problems and needs in mental health research and practice in Japan.