Introduction

Depression is a deterioration of a person’s mental health that implies a lower ability of the individual to function, characterized by sadness, emptiness, or irritability1. By influencing the ability to perform tasks, the depression can produce a reduction in productivity and impairment that implies disability2,3. The age of onset for depressive disorders is between 30 and 35 years old4. Among the factors involved in the tendency to suffer from depression are job instability and low social support5,6,7.

Depressive symptoms constitute one of the main mental health problems; thus, most psychiatric care is devoted to treating depression8. Worldwide, it is estimated that 12.26% of the population suffers from mental disorders, with depression at 3.44%; in Central America, the prevalence of depressive disorders stands at 3.19%9. In Costa Rica, the prevalence of concurrent symptoms with major depressive disorder was 9.5% at pre-pandemic dates due to COVID-19 (n = 797), while during the pandemic it was 61.0% (n = 6786)7. In addition, it has a rate of 7.6 (95% CI 5.29–10.53) deaths by suicide per 100,000 inhabitants in the country, ranking above most countries in the Americas10. Although one of the characteristics of depression is suicidal ideation, it is difficult to predict imminent suicide11.

There are sexual differences in the comorbidity of mental disorders, since women tend to internalize disorders, while men tend to develop external disorders such as substance misuse6. As in the rest of the world, in Costa Rica women have a higher prevalence of depression than men7,12, although the reasons for these differences are still being studied4,13.

This has consequences to the economy, due to inability to carry out work activities, in the quality of life due to the disability it produces, loss of life, the reduction of educational opportunities, work and social connections, affecting future generations3,14,15. In Costa Rica, 8.0% of disability is due to exclusively depressive disorders, excluding substance-related disorders, self-harm, and suicide; mainly affecting working-age people3. This implies that mental health depends in part on the social and economic conditions in which the person finds him- or herself.

Being a treatable and preventable disease, its early identification avoids health complications. Therefore, prevention, early detection, and treatment efforts are essential15. With adequate detection of depression, $11,134 per year could be saved per woman and $34,065 per man15. Thus, routine detection of depression in populations is cost-effective regardless of gender or age15. Population screenings are effective both through self-administered surveys and during the COVID-19 pandemic16.

The benefits of detecting depression early can only be achieved with proper studies, since making erroneous diagnoses not only makes detection more difficult but can also lead to great emotional distress, worse prognosis, and discrimination by social characteristics or group belonging17,18,19,20.

Therefore, it would be necessary to consider the importance of screening the population properly. If a tool is used in circumstances other than those for which it was designed, such as a different culture or language, adaptations must be made so that the tools are equivalent21. The equivalence of them in populations is closely linked to validity, especially when administered across cultures22. Conceptual equivalence (the meaning attributed to a given concept), items (guaranteeing relevance, clarity and relevance), semantics (cross-cultural meaning), operation (administration, qualification and interpretation under the same conditions for quality assessment), measurement (referring to identical psychometric properties between cultures) and functional equivalence (fulfillment of test purpose: adequate measurement of construct while respecting group independence) are all important considerations23.

The PHQ-9 is a test used for assessing the severity of depression in adults. It can be completed within 5 min and is self-administered. It can be used for monitoring the evolution of depression in response to treatment24,25. It can be applied to several patients and have results in a short time. It is often used in primary care to diagnose depression and its severity quickly and efficiently26. It has been studied for its high relationship with the anxiety test GAD-727,28, and Fear of Covid-19S29, and indirect relationship with resilience tested by RS1430,31.

PHQ-9 has been translated into many different languages32,33,34,35. It has been validated in Latin American countries such as Peru36,37 or Chile38,39, and has been widely used as a tool for identifying symptoms associated with depression during the COVID-19 pandemic throughout the world40,41,42,43, However, as of the date of the presentation of this document, there is no reference to a validation in Costa Rica.

There have only been a few studies of depression in Costa Rica, offering prevalence data that is not close to other populations44. In this way, prevalence studies place depression with very different values: 1.08%45, 12.646 and 27.37. Likewise, in women it manifests itself with a ratio of 3 to 1. Culturally, men do not express their feelings and it is only after 30 years of age that the first diagnoses of depression are given47.

The main goal of this article is to evaluate the psychometric properties of the Spanish version of the PHQ-9 and its variants: PHQ-8, PHQ-4, and PHQ-2 questionnaires, in a sample of Costa Rican adults by testing the factorial structure48. We expect it to have a unifactorial structure, as in the original validation of the tool in Spanish49. The second objective is to assess the cross-sectional multigroup invariance (MGCFA) (female versus male participants) of the PHQ-9 test and its variants.

To do this, the descriptive statistics of the tests will be verified, their psychometric properties evaluated, new cut-off points established for both the general population and differentiated by sex, MGCFA multigroup invariance verified, and relationships between the tests established through multivariate analysis.

Methods

Study population

Costa Rica is the Latin American country listed as the happiest50. The country lacks robust health coverage and mental health research is not advanced; This implies that data on the characteristics of depression are unknown51. This circumstance is more accentuated in rural areas where mental health professionals are scarcer51. Additionally, addresses are often described in relation to local landmarks rather than street names or house numbers, leading to complications in providing mental healthcare.

Costa Rica population has a total of 84.7% of internet access (87.0% urban and 78.7% rural), also 96% have phones, 47% have computer, and 16.7% have tablet52. Costa Rica has a low level of upper secondary and post-secondary educational level (18%) and the lowest percentage of doctoral achievement 0.1% of the OECD countries53.

Its ethnic distribution is made up of: Caucasians, who constitute 83.6%, mixed race 6.72%, South Amer Indians 2.42%, the rest are Undeclared or Others54.

Costa Rican Spanish recognizes twelve linguistic features spread across five large regions, forming internal sociolinguistic differences55, there being a complex compendium of unique social characteristics such as gypsyisms such as the achará56, making it difficult to understand and express the language among other cultures. Where other versions of Spanish are spoken57, such as Spanish or Mexican the questionnaire was validated previously.

The inclusion criterias were adults residing in Costa Rica at the time of evaluation, ability to read, and understand Spanish, and having expressed their explicit consent to participate in the study.

A total of 14,702 responses were obtained. Of these, 13,506 responses had some missing data and were eliminated. To facilitate responses, completing the survey was not made mandatory. Data such as age and sex, or incomplete responses to the tool were used as criteria for eliminating responses. Given that we had sufficient data, a decision was made to perform data cleaning of this magnitude. An additional 34 responses were removed due to being completed too fast. Our sample consisted of 1162 completed surveys. The age, gender, civil status, education level, province, and employment status of each participant were recorded (Table 1).

Table 1 Sociodemographic characteristics of the sample.

In a previous analysis, a minimum sample size of 1100 subjects was required to calculate structural equation models58.

Instruments

The use of the PHQ-9 during the pandemic allows for reaching adequate conclusions59. The PHQ-9 was administered, and is used to detect symptomatology concurrent with depression by screening, which has excellent psychometric properties, both its internal consistency (α = 0.89) and its test–retest reliability (ICC = 0.84)26,60. Based on Classical Test Theory, its items have scores ranging from 0 to 3. The PHQ-9 scale has total scores ranging from 0 to 27 and is obtained by adding the score of each item. The test correction has cut-off points of 5, 10, 15, and 20, resulting in the following classification: minimal depression (0–4); mild depression (5–9); moderate depression (10–14); moderate to severe depression (15–19); and severe depression (20–27). The PHQ-9 test consists of nine main items plus an additional three items that ask if you have experienced discomfort due to any of the problems mentioned in the main items. These additional items refer to the difficulty caused by these problems in three contexts: work, household tasks, or social relations. The last three items have response options ranging from 0 (no difficulty) to 3 (extreme difficulty).

For this research, the team made minor modifications taking from example various versions of the PHQ-9 test. We selected items from different versions of the test that had already been validated in other Spanish-speaking countries. These cultural modifications were made to improve comprehension while preserving the original meaning of the items (see Supplementary material).

The PHQ-9 has several variants. The PHQ-2 version is considered within the PHQ-9 as it corresponds to its first two items, forming a reduced version of the main version. Additionally, these two items are considered the core criterion for depressive disorder. Its score range goes from 0 to 6, with scores greater than or equal to 3 indicating depression61,62,63,64. The PHQ-8 version includes all items of the PHQ-9 except for the ninth item related to suicide and indicates current depression at a cut-off point ≥ 1065. The PHQ-4 version consists of the first two items of the PHQ-9 combined with the first two items of the GAD-7 (General Anxiety Disorder-7), thus examining both depression and anxiety66.

For external validity, we used tests to assess anxiety, resilience, and fear of Covid. The GAD-7 is a screening scale for measuring generalized anxiety disorder67. It has strong internal consistency (α = 0.94) and good test–retest reliability (ICC = 0.83) and uses seven items with response options ranging from 0 to 3. The total score of the test results from the sum of its items, with minimum and maximum scores ranging from 0 to 21. The thresholds proposed in its original scale are: minimal anxiety (0–4), mild anxiety (5–9), moderate anxiety (10–14), and severe anxiety (15–21). The GAD-2 questionnaire is included in the GAD-7 and corresponds to its first two items. A score ≥ 3 indicates clinically relevant anxiety disorder62,67,68,69. The RS-14 Resilience test measures an individual’s degree of adaptation to adverse situations. This scale consists of 14 items with theoretical scores ranging from 98 to 14, indicating different levels of resilience according to this score range: very high resilience (98–82), high resilience (81–64), fair (63–49), low resilience (48–31), and very low resilience (30–14)70. The Fear of Covid-19 Scale measures fear of Coronavirus disease (FCV-19S). With theoretical scores ranging from 7 to 3571,72, it indicates a direct relationship between fear of illness and test score.

Finally, as an external criterion, questions about current depression and anxiety were included in the tool: suffering from depression; suffering from anxiety; taking medication in the last 30 days for depression, sleep, or anxiety; or being in psychiatric or psychological treatment in the last 30 days.

Informed consent was obtained from all participants.

Procedure

A self-administered survey hosted on an online survey platform was used (LimeSuvey). The access was made through a link. The survey was answered by adults from the general population of Costa Rica for 8 months (between March 22nd and September 22nd, 2021), during the second year of the pandemic in the country. The population just needed internet and a device for answering the survey, as it can be done using a mobile device, tablet, or computer. Participation in this study was made from the Ministry of Health of Costa Rica, Distance Education University of Costa Rica (UNED), the National University (UNA), and the Costa Rican Social Security Fund (CCSS). The institutions requested participation through their institutional websites, and their social networks: Facebook and Twitter.

Four attentional check items were distributed throughout the questionnaire: “If you are reading this, please select the following option: totally agree/agree/never/always”.

Participation was assured to be anonymous, confidential, voluntary, and participants could withdraw from the survey at any time. Additionally, information on national psychological care and resources was available at the end of the survey.

All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of the Ministry of Health through the National Health Research Council (CONIS) under Agreement No. 22 of the ordinary session No. 38 of August 26, 2020, with the number: CONIS-242-2020 confirming that all research was performed in accordance with relevant guidelines or regulations. This study was not preregistered.

Data analysis

First, descriptive statistics were obtained for frequency, means, standard deviation, asymmetry, and kurtosis using the statistical program SPSS v.2573. To estimate concurrent validity, Spearman’s Rho and Pearson’s correlations were used.

The FACTOR v.12.01.02 program74 was used for internal validity analyses, descriptives, frequencies and item properties. Factor analysis was performed following recommended standards75. Matrix analysis was conducted through polychoric correlation with the Minimum Rank Factor Analysis factor model (MRFA)76. To determine the Kaiser–Meyer–Olkin index (KMO), the entire sample was used as two random subsamples with the Solomon method77. In the analysis presented below, explained variance is based on eigenvalues. Additionally, the Measure of Sampling Adequacy (MSA) statistic is offered which advises eliminating items with values less than 0.5074. For internal consistency, McDonald’s Omega (Oω) statistic was used under the Omega macro for SPSS beta 0.2 package78.

To calculate sex invariance in the model, Multi-Group Confirmatory Factor Analysis (MGCFA) was used and executed using AMOS79. Changes in χ2 were reported with their degrees of freedom and p-values. The parameters AIC, GFI and RMSEA were reported to demonstrate goodness-of-fit to a single model factor structure for PHQ-9 test.

To calculate optimal cohort points for depression, being in psychiatric or psychological treatment for depression was an additional question, this determines robustly the two different groups: one with a depressive trait and another without it. Similarly, to establish cut-off points for PHQ-4 test, criteria such as suffering from depression while also suffering from anxiety were used. The Youden’s J index and Closest-top-left criteria on ROC curve were used; when these indexes did not coincide, Youden’s J statistic was chosen80. Sensitivity, specificity, and area under ROC curve are also presented. Cut-off points for PHQ-8 were established based on its original version while respecting severity length from optimal shear point65.

Finally, to calculate new cut-off points indicating severity of depression for each test, corresponding bases were added to cut-off points obtained using Youden J indices and Closest top left on ROC curve with respect to ranges composing each court to establish severity of depressive symptoms.

A confidence level of 0.05 was marked as significant for all analyses including confidence intervals (CI) at 95%.

Results

Participants

A total of 1162 people participated in study with a mean age of 35.52 years old (SD = 12.18) ranging from 15 to 78 years old (Table 1).

Psychometric properties

The mean and standard deviation of PHQ-2, PHQ-4, PHQ-8, and PHQ-9 scales were 2.38 (1.94), 4.97 (3.65), 9.33 (6.90) and 9.80 (7.43) respectively, with scores reach-ing theoretical range of tests.

Internal consistency of items was high, with lower scores in item 9 of depression (Table 2). Among different scales, high correlations were observed, positive for depression and fear of COVID-19 and negative for resilience (Table 2).

Table 2 Correlation table. Spearman's Rho bivariate correlations down, Pearson’s up.

Eliminating any item did not increase the internal consistency of the questionnaire (Table 3). The suicide item represents an extreme condition of depression and low scores. It forms part of the general dimension of depression and is recommended to remain on the scale to assess depression. Item 4 (loss of energy) is considered the easiest item to answer even though items 1 and 2 make up the core of major depressive syndrome.

Table 3 Descriptives, item properties and frequencies. PHQ-9.

The PHQ-9 test had a determinant close to zero (0.002) and obtained a similar KMO measure for both subsamples (0.937 and 0.933) and for the complete sample (0.996). A single factor was extracted that explained 73.33% of the variance. The eigenvalue of its next component did not reach unity (0.542).

The respective indices were obtained for different versions of PHQ-9 to calculate their internal consistency and factorial structure (Table 4). McDonald’s Omega could not be calculated for PHQ-2 because it only has two items; thus, a reduction in dimensions was not possible. However, results are presented for all cases where a single factor was extracted.

Table 4 Psychometric properties of the different versions of the PHQ-9.

Cut-off-points

As shown in Fig. 1, different versions of PHQ-9 function similarly for depression prior to diagnosis criteria. Receiver Operating Characteristic (ROC) curves can be ordered by the number of items included in each questionnaire version.

Figure 1
figure 1

ROC curve for the different versions of PHQ regarding the diagnosis of depression.

When analyzing the segment in the upper left corner (Closest-top-left) on the ROC curve, tests with more items have better specificity and sensitivity indices; however, these differences are slight (Table 5).

Table 5 Selected sensitivity and specificity based on perceived depression.

Table 5 shows different cut-off points of PHQ versions for both the general population and the population according to sex. Optimal thresholds for the general population coincide with PHQ-2 but are disaggregated in the rest. Consequently, the optimal cohort points for men are 1 or 2 points lower than for women, depending on the version administered. In PHQ-8 and PHQ-9 for women, the optimal cut-off score is 11 points. On the other hand, for males, the optimal score is 8 and 10 points, respectively. Below is Table 6 summarizing different thresholds found as well as the severity of depression according to different segments.

Table 6 Cut-off points for the severity of depressive symptoms of the PHQ-9 versions in the Costa Rican general population, its original measure and new cut-off points split by sex.

Discussion

The PHQ scale demonstrates adequate internal consistency (with an α range of 0.853 to 0.929) for different PHQ versions analyzed. These values are like those found in other studies26,60,81. All versions studied exhibit concurrent validity with positive and high correlations for GAD-7 (ranging from 0.767 to 0.931), positive but slight for FCV-19S (ranging from 0.246 to 0.338), and negative but moderate for RS-14 (ranging from − 0.620 to − 0.647), all of which are consistent with expectations.

The PHQ-9 exhibits high inter-item correlations like those found in other studies (ranging from 0.416 to 0.756). As expected, the suicide item has lower but still significant correlations (ranging from 0.416 to 0.552)82.

As an indicator of internal validity, a single factor was obtained in all PHQ-9 variants63,64,65 including the PHQ-4, which also yielded a single factor despite comprising two different theoretical dimensions66.

The thresholds for the general population coincide with those proposed in its original version. To assess depression in men, cut-off points have been lowered for both PHQ-4 and PHQ-8. In contrast, to assess depression in women, original thresholds for PHQ-2 and PHQ-4 have been maintained; however, cut-off points have been increased for both PHQ-8 and PHQ-9. If cut-off points from this study were not applied, men would be underdiagnosed, and women would be overdiagnosed.

Since scores from PHQ-8 are assimilated to those of PHQ-9 for convenience in its original article, this alters thresholds of PHQ-8 in practice leaving less scope for extreme scores. Although new cut-off points have been established to maintain the nature of PHQ-8, it continues to have less range in the severe part of depression.

The diagnostic validity of PHQ-9 with a sensitivity of 78.60% and a specificity of 72.05% presents lower but adequate indices compared to its versions in English (88% sensitivity and 88% specificity)26, Spanish (87% sensitivity and 88% specificity)35,49, Mexico (for PHQ-2) 80.0 and 86.983.

Despite having only two items, PHQ-2 can detect depression with slightly lower sensitivity and specificity values. Considering the reduction in items, it maintains excellent psychometric properties compared to its longer version. Since PHQ-4 also assesses anxiety, it was compared with the anxiety-depression construct, obtaining a ROC curve like that obtained by comparing it only with depression.

A limitation is although both gender and sex of each patient were assessed, a sample of people born with a gender other than the one identified was too small to form a representative group. Another limitation is the race or ethnicity of participants was not assessed. A large number of discarded tests could be not considered a limitation due to the sample size is being sufficient and data cleaning meeting valid criteria. An explanation for missing data is that this research was published on TV, radio, newspaper, and institutional web pages. The health minister and other relevant public figures asked the population to fill out the questionnaires. That phenomenon made most of the population enter the platform just to satisfy their curiosity. This could be a limitation to not retaining most of the participants to fill all the survey.

Although our sample is representative and heterogeneous in terms of age, sex, academic level, and marital status, we acknowledge that no formal approach was followed to ensure this. This is a cross-sectional study, so a future longitudinal follow-up could be more valuable. Convenience sampling represents a weakness in the generalization of the findings of the present study. Another sampling limitation is that in the country the devices for accessing the internet are not well distributed52, and because of this, the population who are without internet access couldn’t be contacted. Despite these limitations, this study provides valuable information on the usefulness of the application of the PHQ-9 within the Costa Rican population.

Conclusions

Due to the ability to detect depression, brevity of application, and self-administered application, use of PHQ-2, PHQ-4, PHQ-8 and PHQ-9 tests are considered adequate tools for detecting depression and its severity in the general Costa Rican population. Due to the high consistency of this test, its application is considered valid in the entire Spanish-speaking community.

Proposed thresholds improve the detection of depression based on gender.