Introduction

Anxiety and depression are the most prevalent emotional disturbances (Wittchen et al., 2011). The high comorbidity between them, close to 50% of cases (Kessler et al., 2015), as well as the impact of both mood disorders on the functioning and quality of life of affected individuals are priority problems for public health systems (Kroenke et al., 2009). The prevalence of anxiety disorders increased to 11% from 1990 to 2010, growing from 200 to 272 million reported cases worldwide (Baxter et al., 2014). A meta-analysis that examined 68 studies conducted in 30 countries, between 1994 and 2014, reported a prevalence of depression of around 13% in the general population (Lim et al., 2018). According to World Health Organization (2011) population reports estimate that by 2030 emotional disorders will be the most disabling mental health conditions worldwide. The COVID-19 pandemic has increased the prevalence of mood disorders in the clinical and non-clinical population (Luo et al., 2020; Xiong et al., 2020), pointing to the relevance of reliable screening tools for the assessment and monitoring of symptoms severity.

In recent years, it has been highlighted the importance of brief, rapid, and reliable screening tools to facilitate diagnosis of mood disorders in healthcare settings (Olariu et al., 2015). Different authors have proposed the use of brief screening tools to reduce misdiagnosis (Castro-Rodríguez et al., 2015), to optimize health system resources (Cano-Vindel et al., 2018), and to improve clinical outcomes (Goldberg et al., 2017). Specifically, the Patient Health Questionnaire (PHQ-4) is one of the most widely used an ultra-brief screening instruments to measure depressive and anxiety symptoms (Kroenke et al., 2009). This ultra-brief self-report instrument combines two items of the PHQ-9 (Kroenke et al., 2001) and two items of the Generalized Anxiety Disorder (GAD-7; Spitzer et al., 2006). The psychometric properties of the PHQ-4 has been explored in clinical (Cano-Vindel et al., 2018; Ghaheri et al., 2020; Kroenke et al., 2009; Renovanz et al., 2019; Weihs et al., 2018) and non-clinical samples (Fong et al., 2023; Kazlauskas et al., 2023; Khubchandani et al., 2016; Kocalevent et al., 2014; Larionow & Mudło, 2023; Löwe et al., 2010; Meidl et al., 2023; Mendoza et al., 2022; Mills et al., 2015) in several countries (i.e., Colombia, Denmark, Ecuador, Germany, Greek, Iran, Korea, Philippines, Poland, Spain, and United States, among others), but mainly in paper-and-pencil format.

Online screening tools are reliable for the detection of mood disorders (Muñoz-Navarro et al., 2017a, b). These questionnaires facilitate data collection and help to avoid limitations of data loss in the classic paper-and-pencil format or response bias in face-to-face interviews. The evaluation of the psychometric properties of an online version is necessary, even if the paper version has been explored (Campbell et al., 2015; Coons et al., 2009; Mendoza et al., 2022). Recently, Cano-Vindel et al. (2018) tested the dimensionality, reliability, and validity of a computerized version of the PHQ-4 in a Spanish sample of 1052 patients from 28 primary care centers. Results indicated adequate internal consistency for depressive (α = .86) and anxiety (α = .76) symptoms. Even though the PHQ-4 has been standardized on a representative sample of 1500 people from the general Colombian population through face-to-face interviews (Kocalevent et al., 2014) –also with adequate properties–, there is no evidence of the psychometric properties of an online version of this instrument in other Spanish-speaking countries apart from Spain. This is the first study that provides information on the online version of the PHQ-4 in a large sample of the Colombian population during the COVID-19 pandemic.

As far as it is known, the goodness-of-fit of a bifactorial structure for the PHQ-4 has not been tested. Recently, Tibubos et al. (2021) evaluated the internal structure of the PHQ-9 using confirmatory factor analysis (CFA). The bifactor model yielded an excellent fit to the data, being superior to that obtained in the one- and two-factor models. Two types of latent factors are defined in bifactor models. The first is a general factor in which all items are allowed to load (i.e., PHQ-4) and the second is composed of specific factors in which the items are distributed by their content (i.e., PHQ-2 and GAD-2). In the case of the PHQ-4 –and following the Clark and Watson’s tripartite model (1991)–, the general factor reflects the shared component of depression and anxiety (i.e., psychological distress; Drapeau et al., 2012), whereas the specific factors (depression and anxiety after controlling for the general negative affect factor) represent low positive affect (for depression; Kroenke et al., 2009) and hyperarousal (for anxiety; Kroenke et al., 2009).

For all these reasons, it seems relevant to study the psychometric properties of the PHQ-4. Particularly, this study evaluated the dimensionality and reliability of the online version of the PHQ-4 in a large sample of the general population in Colombia collected during the first phase of lockdown measures occasioned by the COVID-19 pandemic (i.e., May to June 20, 2020). The four objectives and hypotheses explored in this research are presented below:

  1. 1.

    Firstly, to examine the goodness-of-fit of the one-, two-, and bifactor model of the PHQ-4. In line with the evidence reported in previous studies (Cano-Vindel et al., 2018; Fong et al., 2023; Kocalevent et al., 2014; Kroenke et al., 2009; Löwe et al., 2010; Meidl et al., 2023; Mendoza et al., 2022), it was speculated that the two-factor correlated model would have a significantly better fit to the data than the other models (hypothesis 1).

  2. 2.

    Secondly, to test invariance (configural, metric, and scalar) of the best-fitting model across socio-demographic characteristics. As in previous validations (Kocalevent et al., 2014; Larionow & Mudło, 2023; Löwe et al., 2010; Mendoza et al., 2022), it was expected that the dimensions were invariant across gender, age, income level, education level, and region (hypothesis 2).

  3. 3.

    Thirdly, to explore the reliability of the PHQ-4 subscales through different reliability indexes (i.e., Cronbach's α, McDonald's ω, and Guttman's λ2). The PHQ-2, the GAD-2, and the PHQ-4 subscales were expected to have the capacity to reliably measure anxiety, depression, and psychological distress beyond the reliability index examined (hypothesis 3).

  4. 4.

    Fourthly, to explore the relationship between the PHQ-4 scores with socio-demographic characteristics of this sample. Based on the results from previous psychometric studies (Cano-Vindel et al., 2018; Fong et al., 2023; Kocalevent et al., 2014; Löwe et al., 2010; Meidl et al., 2023; Mendoza et al., 2022), it was expected that females, older individuals, those with lower incomes, employed, or with lower levels of education would exhibit higher depressive, anxiety, and psychological distress symptoms (hypothesis 4).

Method

Study design

Data analyses of the online version of the PHQ-4 were conducted using the database of the PSY-COVID study in Colombia (Sanabria-Mazo & Sanz, 2021). PSY-COVID is a cross-sectional study that aimed to assess the psychosocial impact of the COVID-19 pandemic in 30 countries. Specifically, this article explored data from the general population residing in Colombia during the first phase of the lockdown measures. Using the database of PSY-COVID, two previous studies have been published on the impact of COVID-19 lockdown measures on the mental health in the Colombian population (see Sanabria-Mazo et al., 2021a, b).

Participants

In total, 18,833 people completed the online questionnaire in Colombia, of which 772 were excluded from this analysis because they resided in other countries during the first wave of COVID-19 pandemic. Finally, the sample consisted of 18,061 participants from all regions of the country. As shown in Table 1, majority of the participants were female (75%), adults between 25–34 years old (30%), with medium income levels (62%), with university education level (90%), and resided in the Andean region (52%). Inclusion criteria were adults (≥ 18 years old) residing in Colombia during the period in which the data were collected (see Table 1). No participants who met the eligibility criteria were excluded from the analyses.

Table 1 Socio-demographic characteristics of the sample

Procedure

Administration of an anonymous online questionnaire generated with Google Forms® was carried out using a non-probabilistic sampling (snowball method) from May 20th to 20 June 20th, 2020. The survey was distributed through social networks, media, and institutional contacts. A panel of 30 international experts in clinical and health psychology validated the online questionnaire. The instruments used in this online survey were piloted prior to administration. No economic incentives were offered to participants for responding to this anonymous survey. The time to complete the socio-demographic and PHQ-4 items were approximately 3 min. More information on the procedure of the PSY-COVID study in Colombia is available in Sanabria-Mazo et al., (2021a, b). This research was approved by the Ethical Committee on Animal and Human Experimentation of the Autonomous University of Barcelona (CEEAH-5197).

Measures

The Patient Health Questionnaire-4 (PHQ-4) was used to measure depressive and anxiety symptoms (Löwe et al., 2010). The two items of PHQ-2 correspond to the symptoms of the Diagnostic and Statistical Manual of Mental Disorders-IV (DSM-IV) diagnostic criteria for major depressive disorder (i.e., loss of interest and depressive mood) and the two items of GAD-2 to the symptoms of generalized anxiety disorder (i.e., nervousness and worries). This version contains 4 items with a 4-point Likert response format, where 0 corresponds to "not at all" and 3 to “nearly every day", and questions are asked in the time frame of the last two weeks. The total score of the PHQ-4 (psychological distress) ranges from 0 to 12 and the specific score of its two subscales (PHQ-2 and GAD-2) ranges from 0 to 6. The cut-off points for detecting probable cases of depression (PHQ-2) or anxiety (GAD-2) is 3 or more for each subscale; and for probable cases of psychological distress (PHQ-4) is 6 or more for total scale (Cano-Vindel et al., 2018; Kocalevent et al., 2014; Löwe et al., 2010; Mendoza et al., 2022).

In addition, a socio-demographic information questionnaire was included to collect data about gender (female and male), age, income level (low, medium, and high), work status (employed and unemployed), educational level (primary, secondary, and university), and region of residence (Amazon, Andean, Caribbean, Orinoco, and Pacific).

Data analyses

Initially, descriptive statistics were conducted for socio-demographic variables as frequencies (n) and percentages (%). Characteristics of the PHQ-4 were explored, including item means and standard deviations (SD), skewness and kurtosis, corrected item-total correlations, among items in each subscale, and between items of different subscales. Given the brevity of the scales, these correlations were analyzed using the Spearman-Brown correction. No outliers were identified in the analyses and none of the participants were excluded due to missing data. The dimensionality of the PHQ-4 was examined through CFA, using maximum likelihood as the estimation method. Regarding dimensionality, it tested a (1) one-factor model with the four items loading on the latent factor; (2) two-factors model including two correlated dimensions; and (3) bifactor model with the four items saturated with a global latent factor of psychological distress plus two uncorrelated specific factors of anxiety and depression.

The Tucker–Lewis’s Index (TLI), Normed Fit Index (NFI), and Comparative Fit Index (CFI) were used to evaluate goodness-off, with > . 90 confidence intervals and the Root Mean Square Error of Approximation (RSMEA) < .08, according to Schermelleh-Engel et al. (2003). The invariance (configural, metric, and scalar) of the models were tested by gender, age, income level, education level, and region in comparable subsamples with random assignment. The configural invariance provides evidence on the consistency of the factor structure of the model across groups, the metric invariance on the factor loadings of the items across groups, and the scalar invariance on the equality of the mean scores across groups (Van de Schoot et al., 2012). In addition, to determine measurement invariance, the multigroup CFA was conducted, observing a change of ΔCFI that is less than or equal to .01, according with Chen’s (2007).

To estimate reliability, both total internal consistency of the scale (PHQ-4) and the subscales (PHQ-2 and GAD-2) were assessed through Cronbach's α, McDonald's ω, and Guttman's λ2. Furthermore, a known-groups validity approach was used to estimate associations between PHQ-4 scores and socio-demographic characteristics that have been reported in the literature as risk factors for depression and anxiety. For this purpose, univariate group comparisons were performed with the PHQ-2, the GAD-2, and the PHQ-4 scores as dependent variables through t-tests and analysis of variance (ANOVA), considering the Bonferroni adjustment for multiple testing. Statistical analyses were performed with SPSS-26®, AMOS-5, R Studio, and JASP®.

Results

Item and scale characteristics

Table 2 shows descriptive analyses of the items, subscales (PHQ-2 and GAD-2) and total scale (PHQ-4). Mean (SD) score of PHQ-2 was 2.28 (1.61), GAD-2 was 2.01 (1.67), and PHQ-4 was 4.29 (3.01). Corrected item-total correlations ranged from r = .62 to r = .77. Correlation between the two items of PHQ-2 was r = .64 and between the two items of GAD-2 was .68, while correlation of the items with the items of the other subscale ranged from r = .46 to r = .68. PHQ-2 and GAD-2 had a correlation of r = .64, indicating high overlap between subscales. All the above correlations were statistically significant (p < .01).

Table 2 Characteristics of the items and subscales of the PHQ-4

Dimensionality

As shown in Table 3, the fit indices for the correlated two-factor model were significantly better (p < .001; X2 = 938.24) than those obtained for the one-factor model [CFI (0.99 vs. 0.94), TLI (0.99 vs. 0.83), NFI (0.99 vs. 94), and RMSEA (0.04 vs. 0.23)], which provides strong support for the adequacy of the original model proposed by Kroenke et al. (2009). The bifactor structure was tested, although no convergence was found for this model.

Table 3 Confirmatory factor analysis (CFA) comparing fit indices of the one-factor and two-factor model of PHQ-4

Regarding factor loadings of the tested factor models, in the two-factor model ranged between .71 and .92, and those of the one-factor model between .68 and .83 (see Fig. 1). The results slightly differed between the specific factors of anxiety and depression. In line with hypothesis 1, these results confirm that the two-factor correlated model have a significantly better fit to the data than the other models.

Fig. 1
figure 1

Factor loadings for the one-and two-factor models of the PHQ-4

Comparable subsamples with random assignment were used to test the invariance (configural, metric, and scalar) of the two-factor correlated model by gender (female: n  = 4,305; male: n  = 4,295), age (≤ 32 years: n  = 9,169; > 32 years: n  = 8,892), income level (low: n = 1,173; medium: n = 1,186; high: n = 1,102), education level (primary: n = 340; secondary: n = 436; university: n = 485), and region (Amazon: n = 285; Andean: n = 447; Caribbean: n = 345; Orinoco: n = 328; Pacific: n = 420). Table 4 shows that no structural differences were identified the best-fitting model according to gender, age, income level, education level, and region with a Δ CFI lower than .01, which confirm hypothesis 2.

Table 4 Test for configural invariance across gender, age, income level, education level, and region using multi-group CFA

Reliability

Reliability of the PHQ-2 (α = .79, ω = .81, and λ2 = .80), the GAD-2 (α = .83, ω = .83, and λ2 = .82), and the PHQ-4 (α = .86, ω = .86, and λ2 = .86) was above .78 on all calculated indicators. The adequate reliability indices for depression, anxiety, and psychological distress confirms hypothesis 3.

Known groups validity

As shown in Table 5, statistically significant differences were found in the PHQ-2, the GAD-2, and the PHQ-4 scores according to gender, age, income level, work status, and educational level, but with small effect sizes (d < 0.2 and η2 < 0.12). Females, younger age, unemployed, and those with lower incomes and educational levels reported the higher depression (PHQ-2), anxiety (GAD-2), and psychological distress (PHQ-4) scores. The higher scores for females, and those with lower incomes and educational levels were consistent with hypothesis 4. However, they were inconsistent with the higher scores expected for older age and employed. For more information, the prevalence of depressive (35%) and anxiety (29%) symptoms can be read in detail in Sanabria-Mazo et al. (2021a).

Table 5 Association PHQ-4 scores and socio-demographic characteristics

Discussion

The findings of this study provide evidence that the online version of PHQ-4 is a reliable ultra-brief self-administered instrument for measuring depressive and anxiety symptoms in the general population in Colombia. Previous studies have demonstrated the validity of the classic paper-and-pencil format of the PHQ-4 in clinical (Ghaheri et al., 2020; Kroenke et al., 2009; Renovanz et al., 2019; Weihs et al., 2018) and non-clinical samples (Khubchandani et al., 2016; Kocalevent et al., 2014; Löwe et al., 2010; Mills et al., 2015), as well as non-clinical samples in online version during the first few months of the COVID-19 outbroke in Philippines (Mendoza et al., 2022). However, as far as it is known, this is the first study to evaluate the dimensionality and reliability of an online version of the PHQ-4 in a large sample of the general population in Colombia during the COVID-19 pandemic.

Consistent with previous research carried out in the classic paper-and-pencil and face-to-face interviews (Kocalevent et al., 2014; Kroenke et al., 2009; Löwe et al., 2010), CFA indicates that the two-factor structure (i.e., depression and anxiety) of the online PHQ-4 performs significantly better than the one-factor structure (i.e., psychological distress), with excellent fit indices on all parameters (CFI = .99, TLI = .99, NFI = .99, RMSEA = .04). Furthermore, these findings were consistent with those reported in another recent study of the PHQ-4 administered online (Mendoza et al., 2022). The bifactorial model did not converge probably due to the small number of indicators per latent variable. Therefore, these results demonstrate that the two-factor correlated model is the best fit to the data, confirming hypothesis 1.

The high correlation between the depression (PHQ-2) and anxiety (GAD-2) subscales was like those reported in previous studies (Kroenke et al., 2009; Löwe et al., 2010; Mendoza et al., 2022). Comorbidity between these mood disorders, close to 50% of cases (Kessler et al., 2015), theoretically explains the high correlation identified between both subscales of the PHQ-4 (Kocalevent et al., 2014; Kroenke et al., 2009; Löwe et al., 2010; Mendoza et al., 2022). The structure of two factors (depression and anxiety) that share a higher order factor (psychological distress) is consistent with the conception of two nosological entities clearly differentiated by the causal cognitive processes and their clinical manifestations (Clark & Watson, 1991) and with the extensive empirical evidence of the high comorbidity of both disorders (Kessler et al., 2015).

As in previous validations (Fong et al., 2023; Kocalevent et al., 2014; Löwe et al., 2010; Meidl et al., 2023; Mendoza et al., 2022), the two-factor structure of the PHQ-4 was invariant (configural, metric, and scalar) across gender, age, income level, education level, and region in this study, supporting hypothesis 2. Given the geographically based cultural heterogeneity of Colombia, regional invariance of PHQ-4 seems to be a particularly relevant finding. Despite the criticisms about the real relevance of the invariance of psychological assessment instruments (Welzel et al., 2021), there is certainly a great consensus that in cross-cultural studies it is necessary to guarantee its homogeneous behavior. Recently, the results of a cross-cultural research conducted in 7 European countries (Austria, Croatia, Georgia, Germany, Lithuania, Portugal, and Sweden) have shown that the PHQ-4 could be generalized to other countries and cultures (Kazlauskas et al., 2023).

With regard to reliability, internal consistency values were slightly higher than those reported in other psychometric studies (Cano-Vindel et al., 2018; Ghaheri et al., 2020; Khubchandani et al., 2016; Kocalevent et al., 2014; Löwe et al., 2010; Mendoza et al., 2022), with values close to α = .83 for depression (PHQ-2), α = .79 for anxiety (GAD-2), and α = .86 for psychological distress (PHQ-4), which supports hypothesis 3. Although small effect sizes were obtained, the findings of this study provide further evidence about gender, age, income level, work status, and educational level role as risk factors for depression and anxiety. In line with other studies and hypothesis 4, it was identified that people of female gender, low income, and low education levels reported higher scores on depression, anxiety, and psychological distress (Khubchandani et al., 2016; Kocalevent et al., 2014; Löwe et al., 2010; Mendoza et al., 2022; Mills et al., 2015). In contrast, it was found that younger people and unemployed reported higher scores than people who were older (Löwe et al., 2010) and employed (Kocalevent et al., 2014), results in line a vast amount of empirical evidence reflecting the negative effect of the COVID-19 pandemic on the mental health (Hossain et al., 2020), a phenomenon also observed in other studies on Colombian population (Caballero-Domínguez et al., 2022; Cénat et al., 2022; Gómez-Restrepo et al., 2022).

Regarding prevalence, 35% of the participants in this sample showed depressive symptoms and 29% anxiety symptoms (see Sanabria-Mazo et al., 2021a, b). Consistent with other research in the general population during the COVID-19 pandemic (Luo et al., 2020; Xiong et al., 2020, it was found that about one third of this sample showed depressive and anxiety symptoms. Specifically, the population groups most affected by the COVID-19 pandemic in Colombia were low-income people, students, and young adults, with depressive symptoms between 46 and 56% and anxiety symptoms between 36 and 40% (Sanabria-Mazo et al., 2021a). Compared with the results of a national survey, this study provided evidence that there were 2.5 to 2.8 times more people with risk of anxiety and 1.5 to 1.9 times more with risk of depression in the first wave of the COVID-19 outbreak (Sanabria-Mazo et al., 2021a). Although the above comparison is from a non-representative sample of the Colombian population, the reported differences highlight the need to prioritize prevention, intervention, and monitoring of symptoms related to emotional disorders.

Limitations

These findings should be interpreted considering the following limitations. First, the analyses were conducted based on a non-representative sample, which impedes the generalizability of the results to the general population of Colombia or other Spanish-speaking languages. Second, due to the cross-sectional design, it was not possible to calculate the test–retest reliability of the instrument. Third, although the convergent and divergent of the PHQ-4 has been demonstrated in previous studies (Kocalevent et al., 2014; Kroenke et al., 2009; Löwe et al., 2010), no other instruments were used to provide further evidence of construct validity. Fourth, diagnostic interviews were not considered as a procedure to verify criterion validity, making it not possible to provide further evidence on specificity and sensitivity for the optimal cut-off point (Mitchell et al., 2016; Plummer et al., 2016). Fifth, it was not possible examine the responsiveness, the smallest detectable change, or the minimal clinical important difference for scoring the PHQ-4. Sixth, online data collection can have a negative impact on the representation of population groups with internet connection difficulties, lack of knowledge in the use of new technologies, and low literacy.

Conclusions

In summary, this study provides further evidence on the dimensionality and reliability of an ultra-brief online screening instruments for the detection of depressive and anxiety symptoms. It also shows that presentation in its online format does not alter its psychometric properties and it is invariant across gender, age, income level, education level, and region. The existing results from the PHQ-2 and GAD-2 denote similar psychometric behavior to the full versions of the PHQ-9 and GAD-2. Although the PHQ-2 and GAD-2 are reliable subscales for rapid screening of depression and anxiety, the use of their full versions is recommended when all DSM-IV diagnostic criteria need to be assessed. In line with the proposal by Löwe et al. (2010), it is suggested to use the total scale as a global screening tool for psychological distress (PHQ-4), and the depression (PHQ-2) and anxiety (GAD-2) subscales for their discriminated detection. Finally, PHQ-4 is an ultra-brief screening tool that can help to optimize the time resources of health systems, especially during health crises.