Depression is a serious public health problem and common mental disorder that adversely affects the quality of life (Klemanski et al. 2017; Paus et al. 2008). A meta-analysis study indicated that the prevalence of depression among Iranian students was high in comparative to the public people living in Iran (Sarokhani et al. 2013). Their results indicated that the prevalence of depression in Iranian students was 33% (CI 95%: 32–34). Depression increased risk of substance abuse, suicide attempts, sleep disturbance, lack of self-care, poor concentration, anxiety, and lack of interest in everyday experiences (Ibrahim et al. 2013). The cost of affective disorders can be particularly high in young people since they represent the future of any community, its hope, and potential leaders (Ibrahim et al. 2013).

Undergraduate students, who are in transition from adolescence to youth, must successfully complete heavy courses and upgrade their resumes in order to find a suitable job in a competitive environment. Therefore, they are more prone to a variety of mental health problems such as depression (Lei et al. 2016; Nezam et al. 2020). Due to the lack of healthcare providers in the universities and their limited access to the diagnostic resources such as psychiatrists and psychologists, the problems cannot be detected, its severity may increase, and it may become a chronic depression (Farabaugh et al. 2019; Hill et al. 2015). The use of highly sensitive and specific screening tools can reduce the workload of healthcare providers and allow them to spend more time with people having depression (Hill et al. 2015).

Screening is a solution to improve depression care (Gilbody et al. 2008; Thombs et al. 2012). Among the variety of tools for screening depression, there is a need for a brief and valid tool that can diagnose depression and its severity (Ferreira et al. 2019). The Patient Health Questionnaire-9 (PHQ-9), its ultra-brief version (PHQ-2), and Well-being Index (WHO-5) are among the briefest and widely used instruments that are translated to many languages and validated in various settings (Wu 2014). Based on the initial validation studies, there were recommended certain structures and also cut-off scores for each instrument. Recent studies performed in different settings revealed that cut-off scores for these instruments cannot be generalized (Manea et al. 2012). For this reason, there is a need to validate and find the best cut-off scores for each instrument (Beard et al. 2016; Kohrt et al. 2016; Smith Fawzi et al. 2019). Accordingly, we investigated the validity, reliability, and optimal cut-off points of the PHQ-2, PHQ-9, and WHO-5 to screen mild depression among Iranian medical university students.

Method

Sample

The study included 400 undergraduate students from Tehran, Iran, Mazandaran, and Kashan medical universities. The adequate sample size for the cut-off point studies was recommended as 250 (Manea et al. 2012). This study was conducted with approval from the ethics committee of Tehran University of Medical Sciences. Purpose of the study was explained to the participants, and informed consent forms were obtained from each participant. Detailed sample characteristics were provided in the results section.

Procedure

The participants completed the self-administered PHQ-9, PHQ-2, WHO-5, and Beck Depression Inventory (BDI-13), and also a psychiatrist (having a 15-year experience in diagnostic) conducted interviews to determine each students’ depression level (for 7 weeks). Hardcopy questionnaires were fulfilled, and face to face interviews were conducted in the classrooms and libraries.

Instruments

Patient Health Questionnaire-9 (PHQ-9) is a nine-item scale that was developed and validated as a depression screening tool for the primary care and non-psychiatric settings (Kroenke and Spitzer 2002; Liu et al. 2011; Zuithoff et al. 2010). It can be used faster than the BDI thanks to its brevity and ease of scoring. Each question rated from 0 to 3 (0, not at all; 1, several days; 2, more than half of all the days; 3, nearly every day), and results range from 0 to 27, with 27 indicating the greatest severity of the depressive symptoms. The optimal cut-off score of the PHQ-9 can be different for patients and general community and also for screening and diagnosis purposes (Kendrick et al. 2009; Kroenke et al. 2010). The PHQ-9 was previously translated and validated in Iran for patients (Dadfar et al. 2018a; Khamseh et al. 2011; Omani-Samani et al. 2018); however, it has not been validated for university students.

Patient Health Questionnaire-2 (PHQ-2) includes the first two items of the PHQ-9 and usually used as the initial depression screening instrument for the major depressive disorder (MDD). Results range from 0 to 6, with 6 indicating the greatest severity of the depressive symptoms. Furthermore, the accuracy of PHQ-2 examined in different studies (Dadfar et al. 2019; Jafari et al. 2014).

WHO-5 Well-being Index is a widely used tool for depression screening consisting five items rated on a 6-point Likert as follows: at no time (0), some of the time (1), less than half of the time (2), more than half of the time (3), most of the time (4), and all of the time (5). The responses are from 0 (worst well-being) to 100 (best well-being). Validity of the WHO-5 was verified among Iranian outpatients (Dadfar et al. 2018b).

BDI-13 is developed to assess the severity of depression. It has 13 items that are rated on a four-point Likert scale from “0” to “3” in terms of intensity, and results range from 0 to 39, with 39 indicating the greatest severity of the depressive symptoms. BDI-13 has been validated and is widely used in Iran (Dadfar and Kalibatseva 2016).

Statistical Analysis

SPSS 19, STATA, and AMOS were used for statistical analysis of the study. The characteristics of participants and mean score of depression for each tool are determined. Normality of distribution was checked by the Shapiro-Wilk test. Further, independent t test and non-parametric Mann-Whitney test were used to compare mean score of depression among different groups such as gender, marriage, and academic grade. The concurrent validity was tested by using Pearson’s(r) correlation between the BDI-13 and other tools. The internal consistency was measured by using the Cronbach’s α coefficients. The construct validity was evaluated by using a confirmatory factor analysis (CFA) and fit indices including chi square/df (χ2/df), Root Mean Square Error of Approximation (RMSEA), Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), Goodness of Fit Index (GFI), Incremental Fit Index (IFI), and Root Mean Square Residual (RMR)).

The accuracy of questionnaires was compared against the psychiatrist diagnosis using the receiver operating characteristic (ROC) curve and area under the curve (AUC). The sensitivity, specificity, predictive values, negative values, and optimal cut-off points were calculated for each screening tool.

Results

The study recruited a total of 442 students; however, 400 participants (90.49%) completed questionnaires and the SCID interview. Among participants, 206 (51.5%) were male, and 194 (48.5%) were female, and the mean age was 23.67 ± 5.37 years, with a range between 18 and 47 years. Demographic characteristics of the participants and mean differences between groups are presented in Table 1.

Table 1 Demographic characteristics of the participants and mean differences between groups

The scores of PHQ-2, WHO-5, PHQ-9, and BDI-13 in males (1.92, SD = 1.6; 9.46, SD = 6.3; 6.72, SD = 5.5; 6.92, SD = 8.5, respectively) and females (1.64, SD = 1.5; 8.85, SD = 6.5; 5.54, SD = 5.1; 4.04, SD = 5.8, respectively) were significantly different for BD_13 (P = 0.028) and PHQ-9 (P = <.001). The internal consistency of PHQ-2, WHO-5, PHQ-9, and BDI-13 was 0.73, 0.94, .88, and .94, respectively. Factor loadings were greater than the threshold value of .40. The internal consistency of each tool measured and results indicated in Table 2.

Table 2 Descriptive statistics and internal consistency

The CFA was conducted for each tool to test their construct validity. Goodness of fit indices, including normed fit index, relative fit index, incremental fit index, Tucker-Lewis index, comparative fit index, and root mean square error of approximation, were satisfactory. Table 3 shows the CFA results.

Table 3 Comparison of the goodness of fit

Concurrent validity of the scales was assessed by using Pearson correlation analysis. Results indicated that PHQ-2 (r = 0.53), WHO-5 (r = 0.54), and PHQ-9 (r = 0.60) total mean scores were statistically significant (P < 0.001) with BDI-13. Correlations between other tools were also statistically significant (PHQ-9 and PHQ-2, r = 0.86; PHQ-9 and WHO-5, r = 0.68; PHQ-2 and WHO-5, r = 0.66; P < 0.001).

The area under the ROC curve of PHQ-9 (AUC: 0.851, 95% CI: 0.814–0.888), WHO-5 (AUC: 0.823, 95% CI: 0.782–0.863), and PHQ-2 (AUC: 0.809, 95% CI: 0.767–0.851) indicates that PHQ-9 provided significantly higher level of discrimination for mild depression (See Fig. 1).

Fig. 1
figure 1

The ROC curve

Accuracy, including sensitivity and specificity of the different cut-off points for the PHQ-2, PHQ-9, and WHO-5, is presented in Table 4. The best cut-off point was obtained for mild depression (cut-off point: 2, sensitivity: 80.22%, specificity: 66.51%; cut-off point: 5, sensitivity: 84.62%, specificity: 70.18%; cut-off point: 9, sensitivity: 79.12%, specificity: 70.64%, respectively). The results shown in Table 4 indicate that PHQ-9 has the highest accuracy and could effectively discriminate between students with and without mild depression.

Table 4 Optimal cut-off point for mild depression in the PHQ-9

Discussion and Conclusion

We aimed to validate the WHO-5, PHQ-9, and PHQ-2 screening tools for depression among Iranian medical sciences students. Consistent with the previous studies in different populations (Arroll et al. 2010; Kroenke et al. 2001), the internal consistency of the tools was satisfactory. The concurrent validity results indicated that these tools are significantly correlated with the BDI-13, thereby confirming the results of the previous studies (Cameron et al. 2011; Dum et al. 2008). We also examined the goodness of fit for the unidimensional structure of the PHQ-9 and WHO-5 and three dimensions of the BDI-II (cognitive, somatic, and affective symptoms). In line with prior studies (Al-Turkait and Ohaeri 2010; Guðmundsdóttir et al. 2014; Keum et al. 2018), the results indicated the satisfactory goodness of fit for the tools.

Incorporating the sensitivity and specificity, the AUC calculated for each tool to estimate the probability that a tool will correctly classify students as depressed or non-depressed (Hanley and McNeil 1982). The AUC values were greater than .80 indicating that the screening tools were successful (Holmes 1998). The results indicated that the validity of the WHO-5 (.823), PHQ-9 (.851), and PHQ-2 (.809) was supported by the excellent discrimination AUC value. The cut-off point of mild depression for the PHQ-9 was recommended as five (Kroenke et al. 2001). The current study confirmed this result, and it was optimal when screening mild depression among the participants. The sensitivity and specificity values at this cut-off point were 84.62 and 70.18, respectively. These findings suggested that the PHQ-9 is a successful tool in screening depression among students. Further, the original cut-off point for mild depression in the PHQ-2 was recommended as three (Kroenke et al. 2003). However, our findings found the cut-off point as two for the optimal discrimination with the sensitivity and specificity of 80.22 and 66.51, respectively. However, the optimal cut-off point for depression screening among adolescents was recommended as nine (Allgaier et al. 2012). The present study confirmed this result, and the sensitivity and specificity values at this cut-off point were 79.12 and 70.64, respectively.

In conclusion, it is important to note that the PHQ-9, PHQ-2, and WHO-5 are brief, easy to use, valid, and reliable tools in screening depression among Iranian medical university students. The cut-off points of two, five, and nine are recommended to identify students with minor depressive disorder using the PHQ-2, PHQ-9, and WHO-5, respectively. The PHQ-9 had the highest AUC value, and therefore, it is highly recommended to apply the PHQ-9 for screening and follow-up assessment.

The participants received the SCID reference standard assessment after the screening tests. This was one of the strengths of the study. However, the study has certain limitations. First, the test-retest reliability was not performed by collecting follow-up data since face to face interactions were stopped because of the COVID-19 pandemic. Further, the participants were recruited by the convenient sampling method, and medical students cannot be representative of the entire student population in Iran. Therefore, further studies should be conducted with a larger sample recruited by random sampling methods.