Cognitive models of social anxiety (SA) and depression propose a hierarchical model whereby maladaptive cognitive schemas guide attentional processes and facilitate biased interpretations congruent with these schemas (Amir et al., 1998; Beck, 1976; Coles et al., 2008; Heimberg et al., 2014; Schultz & Heimberg, 2008). Specifically, in SA, models have hypothesized that interpretation biases are not only associated with SA, but also are a relevant maintenance mechanism (Clark & Wells, 1995; Heimberg et al., 2014). Similarly, it has been hypothesized that individuals with depression tend to create more negative interpretations of ambiguous information (Clark et al., 2000). Some authors consider that this negative interpretation of ambiguous information can be the cause of depression (Beck & Haigh, 2014).

Interpretation biases in ambiguous social scenarios have been examined mainly through two different lines of research. The first line includes studies that evaluate interpretation biases in imagined ambiguous social scenarios, which are verbally described to the participants (Dineen & Hadwin, 2004; Everaert et al., 2020; Sanchez et al., 2015). Commonly, these studies show a higher negative interpretation bias in participants with SA or depression. In fact, a meta-analysis has confirmed the association between the negative interpretation of ambiguous scenarios and verbal stimuli with SA (g = 0.97; Chen et al., 2020). Another meta-analysis that included mainly studies employing this type of stimuli also found an overall association between negative interpretation and symptomatology of depression (g = 0.72; Everaert et al., 2020).

The second line of research includes studies based on the interpretation of facial expressions. Faces are a crucial source of information to recognize the emotional state of others (Ekman & Friesen, 2003). Consequently, the information processing of ambiguous facial expressions is a fundamental aspect of social interactions. When these situations are interpreted negatively, hostile and critical intentions are attributed to others, which may contribute to triggering anxious and/or depressive feelings (Maoz et al., 2016). The method employed to assess interpretation bias using faces typically involves generated stimuli through intermediate (i.e., morphed) images between two facial expressions, each one with a different emotion, resulting in a battery of expressions of different intensity. For example, Jusyte and Schönenberg (2014) combined fearful, happy, and angry faces and generated three sets of stimuli (happy–fearful, happy–angry, fearful–angry).

In terms of anxiety, some studies concluded that people with SA tend to classify ambiguous faces as less trustworthy (Gutiérrez-García & Calvo, 2016; Gutiérrez-García et al., 2019) and interpret ambiguous faces (Maoz et al., 2016) or even neutral or ambiguous faces as negative to a higher extent (Lira Yoon & Zinbarg, 2007; Prieto-Fidalgo et al., 2022). However, other studies did not find this tendency (Jusyte & Schönenberg, 2014; Schofield et al., 2007). Moreover, the relationship between the interpretation of ambiguous faces and SA is not entirely clear. In one study, individuals with SA were sensitive to the identification of the expression of “fear” when comparing fear-anger or fear–happy faces, but no differences emerged when comparing happy–angry images (Garner et al., 2009). Another study with a similar methodology and stimulus set failed to replicate differences between the SA group and healthy participants (Jusyte & Schönenberg, 2014). Similarly, no differences were found in the identification of facial morphs with happy and disgusted expressions as anchor points (Schofield et al., 2007). Despite the diversity in findings, which may be explained by the diversity of methodologies employed, a recent meta-analysis found that people with higher levels of SA tend to interpret images of faces more negatively with a medium effect size (g = 0.60;Chen et al., 2020). Some studies also encountered differences in the reaction time (RT) that needed to make a decision. For example, a study demonstrated that the social anxiety group was faster interpreting images as angry than as happy and was slower when making happy judgments relative to the control group (Maoz et al., 2016).

Regarding depression, a meta-analysis found that people with high depressive symptomatology made more negative interpretations as well as fewer positive interpretations in studies that included the evaluation of ambiguous verbal or visual stimuli (Everaert et al., 2017). However, only three of the 87 included studies used a face interpretation task (Beevers et al., 2009; Lee et al., 2016). One of the studies found a higher number of negative interpretations in ambiguous faces in a combination of happiness–sadness and happiness–fear in participants with depressive symptoms (Beevers et al., 2009). In addition, patients with major depression had greater general difficulty in detecting happy emotions when compared with neutral emotions (Soto et al., 2021). Additionally, a study has also found that people with depression require less time to correctly recognize faces of sadness, fear, and anger (Lee et al., 2016). Another study reported that the depressed group, compared with the control group, required more time to interpret happy faces (Leppänen et al., 2004). Shiroma and colleagues (2016) found that participants without depression were faster at identifying happy faces when compared with people with depression, but the differences were not statistically significant. In summary, studies suggest that people with depression require more time to process faces associated with positive emotions (happy faces) and require less time to process faces associated with negative emotions (sadness, fear, anger).

Indeed, cognitive mechanisms are assumed to be hierarchically organized, so interpretation biases would be guided by cognitive styles (Beck & Dozois, 2011; Clark & McManus, 2002). One example is the looming maladaptive style (LMS; Riskind et al., 2000). LMS is proposed as a maladaptive cognitive style, mainly associated with anxiety. It has been described as a danger schema that “produces schematic biases in the selection, interpretation, and recall of potential threat” (Riskind et al., 2000, p. 838). People with LMS would tend to perceive that a threat is going to get progressively worse. Typically, LMS has been associated with anxiety disorders (Adler & Strunk, 2010), including SA (Brown & Stopa, 2008), but not with depression (González-Díez et al., 2014; Riskind & Williams, 2005) and it has been demonstrated that it predicts maladaptive cognitions such as automatic thoughts (Calvete et al., 2016). In fact, it has been hypothesized that the LMS could influence interpretation biases (Riskind et al., 2000). For example, Riskind et al. (2000) asked participants to listen to homophone words (e.g., “die” vs “dye”) and found that LMS predicted the tendency to hear a greater number of threatening words (e.g., “die”). In addition, they also found a similar effect after the presentation of images, i.e., LMS was associated with the recall of a greater number of threatening pictures (explicit memory). Finally, LMS in that study was associated with a greater number of writing down of threatening words (implicit memory) after participants were asked to write down the first word that came to mind. In the same way, since it has been proposed that LMS would lead to a more biased interpretation and that faces are a relevant social cue (Ekman & Friesen, 2003), it can be expected that LMS related to social events would also be associated with a negative interpretation of ambiguous faces.

The disparity in the results within SA and between SA and depression may be due to the differences in methodologies. For example, while some studies are measuring responses dimensionally, considering the intensity of emotion (Garner et al., 2009; Jusyte & Schönenberg, 2014) or the degree of confidence that it generates (Gutiérrez-García & Calvo, 2016), others could be requiring greater reflection on the part of the participants compared to more direct methodologies, such as a forced choice between two emotions (Maoz et al., 2016). Moreover, studies that force discrimination of ambiguous faces between two emotions (e.g., happy–angry) presuppose a negative valence for certain emotions (e.g., anger, sadness, fear, disgust) and a positive one for others (e.g., happy, being contend). However, a happy face can be considered false or unreliable (Gutiérrez-García & Calvo, 2016). A final example of methodological differences lies within the stimuli themselves. For instance, using stimuli that include facial features highly associated with a specific emotion (e.g., faces with the mouth opened and visible teeth identified as smiling faces and faces with the mouth closed as angry faces; Jusyte & Schönenberg, 2014) probably can lead to automatic responses that could prevent the activation of higher-order cognitive mechanisms associated with interpretation biases.

As previously discussed, theoretical models of social anxiety and depression consider interpretation biases as a relevant maintenance factor (Beck & Haigh, 2014; Clark & Wells, 1995; Clark et al., 2000; Heimberg et al., 2014). In addition, it is believed that tasks employing visual stimuli of faces may be more ecological than other types of psychometric tools for assessing interpretation biases (Heuer et al., 2010). Despite all this, to the best of our knowledge, no previous study on interpretation bias that uses images of facial expressions has examined the test–retest reliability of the instrument. Taking into account that reliability of a test is a significant component of its validity (Kappenman et al., 2014) and is necessary to establish the consistency of individual patterns (MacLeod et al., 2019), analyzing the reliability of this type of instrument is mandatory. For example, having a consistent instrument across time allows us to carry out longitudinal studies or analyze if interventions can modify interpretation bias. This is also a probe of the stability of the construct across time. The scarcity of interest in test–retest reliability in interpretation biases contrasts with in-depth analysis of the stability of tasks tapping into low level levels of processing—that do not require participants’ interpretation or identification—have been performed on attentional bias tasks (e.g., the dot probe) (Bantin et al., 2016). The results have indicated low levels of test–retest reliability (MacLeod et al., 2019), reducing enthusiasm for using this task. However, the test–retest reliability of interpretation bias tasks, which require higher level cognitive processes (identification or interpretation), remains to be determined.

Therefore, this study developed a facial-emotion task to assess interpretation bias based on some methodological improvements relative to prior work. First, as mentioned above, expressions of happiness may not be interpreted necessarily as positive. Therefore, methods that ask the participant to recognize faces as a concrete emotion can fail in identifying interpretation bias when a participant correctly recognizes a happy face as happy but interpret the face in a negative way (e.g., unreliable or untrustworthy). Thus, employing a distinct method from the classification of the emotion could better capture the interpretation bias (Gutiérrez-García et al., 2016). For example, a forced choice between “positive” and “negative” options could be an adequate option. Second, to avoid an automatic response to signals that are very associated with a specific emotion (e.g., faces with an open mouth with happy emotion), following the recommendation of Jusyte and Schönenberg (2014), stimuli could be selected without this type of obvious signal (facial expressions with a closed mouth).

A primary purpose of study is to explore the functioning of the task, analyzing the test–retest reliability of the different indices of the task and analyzing sources of the validity of the task (relationship with SA, depression, and LMS). The secondary purpose of the study was to analyze the relationship between interpretation bias measured by means of the new interpretation bias task with ambiguous faces and symptoms of SA and depression, and LMS. Concerning primary purpose, we hypothesize that (1) the angrier the faces are displayed, the higher the rate of negative responses will be; (2) test-retest correlation coefficients will be similar to those reported by other studies that use the dot probe—the conceptually and methodologically most similar task that has previously assessed test–retest reliability—and (3) that the indexes of the task will be associated with SA, depression, and LMS, and these will discriminate between low and high SA and depression symptoms group. In relation to the secondary purpose, it was hypothesized that people with higher symptoms of SA and depression will label more ambiguous faces as negative, would need more time to label happy faces as positive, and would need less time to label anger faces as negative. The same tendency toward negative evaluation, slower response to label happy faces as positive, and faster response to label anger faces as negative would also be associated with the social LMS, but—according to the literature (Riskind & Williams, 2005)—not with the physical one.

Method

Design and Participants

We carried out a two-wave longitudinal study, including 864 first-year college or vocational training students from 10 different centers of Bizkaia (Spain). We calculated a posteriori power with G*Power (Faul et al., 2009) and indicated that which the number of participants and a power of 80%, it would be able to detect correlations of r = 0.10 for cross-sectional analysis and of r = 0.30 for test–retest correlations. The age of the participants ranged from 15 to 29 years (mean age = 19.54, SD = 2.51, 44.4% women). In addition, 84 participants completed the measure again 1–2 months later in order to evaluate the test–retest reliability (mean age = 19.51, SD = 0.13, 81.9% women).

Instruments

The Spanish version (Olivares et al., 2005) of the Social Anxiety Scale for Adolescents (SAS-A, La Greca & Lopez, 1998) was used to assess SA. The SAS-A is composed of 18 items (e.g., “I am ashamed to be surrounded by people I do not know”) with three subscales: Fear of Negative Evaluation (FNE), Social Avoidance and Distress in New Situations or with Strangers (SAD-New), and Social Avoidance and Distress in General Situations or with People you Know (SAD-G). Each statement is rated on a five-point frequency-type scale ranging from 1 (never) to 5 (all the time). Studies have confirmed the internal consistency of the items and the three-factor structure of the SAS-A in a Spanish sample (Olivares et al., 2005). The overall internal consistency of the SAS-A was 0.93 (Cronbach’s alpha), with respective subscale coefficients of 0.89, 0.81, and 0.83 for the FNE, SAD-New, and SAD-G.

To measure depressive symptoms, the Center for Epidemiologic Studies-Depression (CES-D; Radolff, 1977, Spanish version: Calvete & Cardeñoso, 1999) was used. This is a 20-item questionnaire (e.g., “My appetite was poor”) rated on a four-point frequency scale from 0 (rarely) to 3 (all the time). Studies that have analyzed psychometrical properties of the CES-D in a Spanish sample have been reported excellent internal consistency, good sensitivity (77.1%), and good specificity (79.4%) (Ruiz-Grosso et al., 2012). In this study, Cronbach’s alpha coefficient was 0.92.

The Spanish version of the Looming Maladaptive Style Questionnaire (LMSQ-R; González-Díez et al., 2014; Riskind et al., 2000) was used to assess LMS. The questionnaire describes six potentially stressful scenarios (three physical and three social) and measures the tendency to estimate whether the risk of a potential threat will increase as well as expectations regarding their deterioration over time. In order to create a shorter battery of questionnaires, in this study, we used two physical scenarios (e.g., heart attack) and two social ones (e.g., social meeting) to assess both physical and social looming styles. The participant had to imagine each situation and rate it on a five-point Likert scale, with three questions related to the expectation of the threat. Studies have confirmed two second-order factors (social looming and physical looming) and measurement invariance of the test across gender in Spanish sample (González-Díez et al., 2014). In this study, the internal consistency for LMSQ-R measured by Cronbach’s alpha coefficient was 0.86 for the total score, 0.82 for Social Looming, and 0.80 Physical Looming.

Ambiguous Faces Interpretation Task

To develop a task for the evaluation of cognitive biases in the interpretation of ambiguous faces, a total of eight models (4 women and 4 men) were chosen from the Chicago Face Database (Ma et al., 2015). For each model, we selected three images—one classified as happy, one as neutral, and one as angry—following three criteria: (1) the model was not wearing any object such as glasses or a hair clip, (2) the model’s mouth was closed, and (3) the specifications of each image (framing, body postures of the models) allowed a clean morphing process. We used Morpheus Photo Morpher® v3.17 software to generate morphed faces. A set of ambiguous faces was generated by combining a happy face of each model with the respective neutral face of the same model and the other set by combining an angry face with the respective neutral face of the same model. As a result, a set of nine levels of morphed faces—including the actual happy, angry, and neutral faces—was obtained, as shown in Fig. 1. Ultimately, a 72-face stimulus battery was generated (nine per model).

Fig. 1
figure 1

Graphic representation of the created stimulus: a continuum of nine levels of transformed faces. Note. We grouped the image levels into three groups: Happy block, Levels 1 to 3; Ambiguous block, Levels 4 to 6; and, Angry block, Levels 7 to 9

The task was developed in JavaScript, following recommendations for stimulus presentation (Garaizar & Reips, 2019). The presentation of each of the 72 faces was preceded by an orange fixation cross for 500 ms on a white background. Subsequently, each face was presented in the center of the screen (640 px wide and 448 px high) only once in random order until the response of the participant on a white background. After, a white screen was presented during 500 ms.

While faces were on the screen, the participants were asked to indicate whether the face presented showed a negative emotion by pressing the N key on the keyboard or P for the positive one. The participants were instructed to response as quickly as possible to the task. The task recorded the responses and RT for each trial (for an illustration, see Fig. 2). Prior to the task, all participants completed four training trials with images not used in the task.

Fig. 2
figure 2

Flow diagram of the interpretation bias task in the interpretation of the value of the faces. Note. First, to catch the participant’s attention the “ + ” symbol is shown. Second, one of the 72 face images will appear until the participant responds by pressing the “N” or “P” keys on the keyboard: the “N” key for a face with a negative expression or the “P” for a face with a positive one

Procedure

The participants answered the scales in a fixed order—sociodemographics, SAS-A, LMSQ-R, CES-D, and ambiguous faces interpretation task—using a computer after providing informed consent. Then, they performed the interpretation bias task. The batteries of questionnaires and the task were presented in the Qualtrics® platform. After 1–2 months, 120 participants were contacted again to perform the interpretation bias task in-person in the laboratory, and 84 people did so. There were no significant mean differences between the participants that chose to answer in the laboratory and those that did not in SA, t(118) = 0.23, p = 0.82, symptoms of depression, t(118) = 0.34, p = 0.73, social LMS, t(118) = 1.01, p = 0.32, and total score of LMS, t(118) = 1.98, p = 0.051. However, participants that did not accept to repeat the task in the laboratory (M = 46.48, SD = 4.63) had higher scores in physical LMS that those that presented (M = 43.79, SD = 7.02). These differences were significant, t(118) = 2.13, p = 0.035. The study was approved by the Ethics Committee of the [Masked].

Indices and Data Analysis

The main index of the interpretation bias task was the frequency of all responses marked as negative (F −). Depending on the image type, the other three frequency indexes were calculated: F(H −) was the number of happy faces (face levels 1–3) marked as negative, F(N −) was the number of neutral faces (face levels 4–6) marked as negative, and F(A −) was the number of angry faces (face levels 7–9) marked as negative. The rest of the indices were based on RT, and when calculating them the data from the cases that were ± 3 SD from the mean were discarded. This cutoff was applied to each stimulus group (Fig. 1). RT(H) was the mean of RT for happy faces (face levels 1–3). RT(N) was the mean of RT for neutral faces (face levels 4–6). In the relation of this index, two sub-indices were created: RT(N +), with only items marked as positive, and RT(N-), with only negative ones. RT(A), were the mean of RT for anger (face levels 7–9). Finally, RT(Tot) consisted of the mean of all RT trials. Given the low number of negative responses for happy faces (levels 1 to 3) and positive responses for negative faces (levels 7 to 9), it was not possible to calculate the specific RT based on a negative or positive response for those instances.

The main analysis was carried out with SPSS® 24: Rho Spearman correlations (rs) and Student t-test. To correct for multiple comparisons of the t-test, the Benjamini–Hochberg correction (false discovery rate, FDR) of p values was applied (Benjamini & Hochberg, 1995). To measure the effect sizes Cohen’s d was calculated. To measure internal reliability Cronbach’s alpha was used. Cronbach’s alpha was calculated using data of the responses of each image (coded as “0” when positive, and coded as “1” when negative). For the comparison between groups (e.g., low and high SA), a group with one-third of the participants with lower scores and another group with one-third of those with higher scores was formed. Cutoffs were x ≤ 14 for depression and x ≤ 35 for SA in the low symptom group and x ≥ 25 and x ≥ 47 in the high symptom group, respectively. To detect outliers in general task performance, we established the criteria of ± 3 SD in one of the indices of the task (1 outlier).

Results

Interpretation Bias Task: Descriptive Analysis and Test–Retest Reliability

As anticipated, only 10.38% of the positive images were interpreted as negative, whereas only 11.46% of the negative faces were interpreted as positive (Fig. 3A. Importantly, 53.21% of the block of ambiguous faces were evaluated negatively. The participants needed less time to interpret happy faces as positive (M = 984.61, SD = 345.73) than to interpret the neutral block as positive (M = 1297.23, SD = 522.17) and the difference was significant, t(854) = 23.73, p < 0.001, d = 0.71 (Fig. 3B). Similarly, negative faces were allocated faster to the negative condition (M = 1042.85, SD = 427.94) than neutral faces to the negative condition (M = 1237.28, SD = 563.59). This difference was also significant, t(853) = 18.05, p < 0.001, d = 0.39.

Fig. 3
figure 3

The distribution of responses to the interpretation bias task. A Percentage of negative responses by image level. B Reaction time by image level and response type

In terms of reliability, the Cronbach alpha was 0.91. Regarding test–retest reliability, the main index, the number of face images marked as negative (F–), had a moderate test–retest correlation, rs(82) = 0.59, p < 0.001 (Fig. 4). The number of face images marked as negative in angry faces, F(A–), had the lowest test–retest coefficient, rs(82) = 0.29, p < 0.001, but the test–retest correlations of the number of images marked as negative with neutral faces and with happy faces were moderate, respectively 0.57, p < 0.001 and 0.41, p < 0.001. The test–retest correlation for the indices based on RT was also moderate and between 0.55 and 0.69. The rest of the test–retest correlations followed a similar trend (Table 1). Regarding the correlations measured at the same time, high correlations were found between the indices based on RT in both waves. Specifically, in wave 1, the correlation coefficients between indices ranged between 0.78 and 0.96.

Fig. 4
figure 4

Dispersion diagram between the numbers of faces interpreted as negative at wave 1 and wave 2. Note. rs(82) = .59, p < .001, r2 = .34. The sample that answered in wave 2 is shown. The elimination of the outliers was done in conjunction with wave 2. Therefore, the cases that might appear to be outliers are not in relation to the total sample

Table 1 Cross-sectional and test–retest Rho Spearman correlations between the task indices

Association of Interpretation Bias with Depression, SA, and LMS

Table 2 shows the correlation matrix between the task indices and the rest of the variables. A positive and significant relationship was observed between the main index, F–, and depression and LMS. This trend was also observed for the number of neutral images interpreted as negative, that is, F(N–) also showed a positive and significant correlation with depression, SA, and LMS. F(A–) followed the same trend, except for depression, where the correlation was not significant. In the case of depressive symptoms, more symptoms were significantly associated with more negative interpretation of happy faces. Regarding RT based indexes, symptoms of depression were associated with a higher RT interpreting happy faces and social LMS was associated with RT(N +) and RT(H). However, the correlation coefficients in all cases were very low (r ≤ 0.22).

Table 2 Rho Spearman correlation matrix between task indices, SAS-A, CES-D, and LMSQ-R

Mean Differences in Interpretation Bias in SA and Depression

When creating subgroups (upper vs. lower tercile) for SA, there were statistically significant differences in F(N −), t(577) = 3.07, p = 0.002, and d = 0.26; F(A −), t(536) = 3.82, p < 0.001, and d = 0.32; and F − , t(561) = 2.74, p = 0.006, and d = 0.23, in which the group high in SA marked significantly more images as negative (Table 3A). For depression, we found that the high depression group scored higher on F(H −), t(544) = 2.84, p = 0.005, and d = 0.24; F(N −), t(565) = 4.41, p < 0.001, and d = 0.37; F − , t(565) = 3.87, p < 0.001, and d = 0.32; RT(H), t(565) = 2.17, p = 0.03, and d = 0.18; and RT(N +), t(523) = 2.60, p = 0.01, and d = 0.22 than the low depression group (Table 3B). That is, the high depression group marked more happy and neutral images as negative. They also needed more time to interpret happy faces and to mark neutral faces as positive.

Table 3 Mean comparison in task indices between subgroups according to symptoms

Discussion

The objectives of this study were, first, to determine test–retest reliability and sources of validity, and second, to analyze the relationship between the indexes—those based on the negative interpretation of faces and those based on time response—of the interpretation bias task using ambiguous faces and symptoms of SA, depression, and LMS. The main findings were as follows: (1) the test–retest reliability of the main index of the task (i.e., the number of faces interpreted as negative) was moderate and the data supported the validity of the task through its association with SA, depression, and LMS; (2) SA, depressive symptoms, and LMS were associated with the number of ambiguous faces marked as negative.

The Task: Performance, Test–Retest Reliability, and Validity

The first objective explored the basic functioning, the reliability (specially, testretest) of the task, and its validity. Regarding the basic functioning of the task, it was as expected. Thus, as the images approached anger, the participants marked a higher number of responses as negative. This trend is congruent with the findings of other studies (Jusyte & Schönenberg, 2014; Richards et al., 2002; Schofield et al., 2007). Also, the participants required less time when their answers were congruent with the image. That is, compared with the time needed to interpret neutral faces, the average RT was shorter when happy faces were marked as positive and angry faces as negative. In other words, there seems to be a higher cost when interpreting ambiguous faces.

With regard to reliability, the overall internal reliability was optimal; therefore, it demonstrates the homogeneity between items (the negative responses to the images). Concerning the test–retest reliability of the task, the overall frequency index, F–, which measures the tendency to evaluate faces as negative, obtained a medium effect size test–retest correlation. As discussed in the introduction, none of the studies that used similar tasks had previously evaluated the stability of this type of measure (e.g., Garner et al., 2009; Maoz et al., 2016). However, compared to the dot probe, which is conceptually and methodologically the most similar task, the present study found higher stability in its main index than studies that used the traditional dot-probe methodology (Chapman et al., 2019) and remained at similar levels to those in a recent study that improved the stability indexes of the dot probe (Aday & Carlson, 2019). Nonetheless, because the dot probe is a different task, this comparison has to be interpreted with caution and can be unwarranted. Thus, this study provides a first indicator of test–retest reliability on an interpretation bias task based on ambiguous faces. The rest of the indexes based on the frequency of the images marked as negative showed low to moderate reliability. Although the indexes that assess the frequency with which happy and angry faces were interpreted as negative resulted in the lowest test–retest correlation, the index based on a negative interpretation of neutral faces showed a moderate test–retest correlation and a very high correlation with the overall frequency index (F −). These data suggest that the reliability of the task is based mainly on the frequency of the number of responses marked as negative in neutral faces.

The analysis of the test–retest reliability of the RT rates revealed coefficients between moderate and high. In addition, the analysis of the correlation coefficients for different RT rates from wave 1 to wave 2 was very similar. For example, correlations from wave 1 to wave 2 of the same RT index (e.g., RT(H) in wave 1 and RT(H) in wave 2) showed no higher scores than the correlations of different rates in both waves (e.g., RT(H) in wave 1 and RT(N +) in wave 2). Due to these results, it might be assumed that these rates could be representations of the very same construct, such as “response time,” and that they might not differ from one another. Therefore, the methodology that was used may not be relevant for explaining individual differences in response time.

Regarding data acquisition, it is important to note that the task was performed online for the first wave, while the second time it was performed in a controlled manner in the laboratory. These data allow concluding that the reliability is good even when performing the task in an online format. In fact, the test–retest reliability would be expected to be higher if it had been evaluated in the two waves in a controlled manner in a laboratory. Future studies could compare reliability by taking into account the method of application (online vs. laboratory). For example, participants who used larger screens may have been able to capture more detail of the images compared to participants with smaller screens. Second, although it was not an objective of the study, it is important to remark that the test–retest coefficient is also a probe of the stability of the construct. Thus, regardless that interpretation biases can be affected when experiencing anxiety, we can conclude that this construct has, at least, some grade of stability across time.

The present work also had a general aim to develop a task that captures interpretation biases that are relevant to symptoms of depression and social anxiety. The main task indexes—those based on the number of negative responses given to images—have been consistently associated with SA, depression and LMS. This consistency of relationship is a relevant source of the validity of the task. However, the effect sizes were low (r ≤ 0.20; d ≤ 0.37). This could be due to several reasons. First, the interpretation of ambiguous faces could be a characteristic that explains only a small part of this type of symptom. Second, different methods were used to assess interpretation biases (computerized tasks) and the other variables (self-reported instruments). Showing that correlations between related variables measured with different methods are low is common in the assessment of some psychological constructs (Morea & Calvete, 2020; Reinholdt-Dunne et al., 2013). Finally, for various reasons, such as the lack of ecological validity, computerized tasks may not fully capture the construct they are intended to measure. That is, the faces are not presented in a natural and ecological context. Future studies could continue this line of research with the goal to understand the best way to capture this type of bias, for example, analyzing if surprise faces (Mueller et al., 2020) capture better this bias. In addition, future studies could analyze which methodology best captures biases, for example, comparing a variety of methods in experimental research.

Relationship Across Interpretation Bias Task, SA, Depression, and LMS

Regarding the second objective, the results showed a positive association between the number of images marked as negative and symptoms of SA and depression. In addition, the group with high symptoms of depression and SA selected more images compared with the low symptoms group. The findings are in line with other studies that found such a bias toward negative interpretations of ambiguous faces in SA (Coles et al., 2008; Garner et al., 2009; Maoz et al., 2016; Schofield et al., 2007) and depression (Beevers et al., 2009; Joormann & Gotlib, 2006), but the effect sizes were low. This suggests that the interpretation bias of ambiguous faces explains a small part of SA and depression symptoms (~ 2–4%). More specifically in SA, other studies have shown that people with SA interpret ambiguous faces more negatively, expressing that faces indicate more rejection (Schofield et al., 2007), interpreting faces as less trustworthy (Gutiérrez-García & Calvo, 2016), or classifying more faces as angry (Joormann & Gotlib, 2006; Maoz et al., 2016). This trend was also observed in depression in other studies. In general, it has been found that people with depressive symptoms interpreted faces more negatively (Beevers et al., 2009; Joormann & Gotlib, 2006; Surguladze et al., 2004).

With regard to the association between the temporal cost (RT) to perform the interpretation and psychological symptoms, the data indicate that depressive symptoms are associated with a higher cost in interpreting ambiguous faces as positive and happy faces as positive or negative. These results are congruent with those of another study in which a depressed group needed more time to interpret happy faces compared with the control group (Leppänen et al., 2004). This makes sense considering that these people have feelings of sadness, slower reasoning, and are biased to negative interpretations in different contexts (Everaert et al., 2017). Although a similar pattern was observed with the symptoms of SA—the more symptoms, the greater the temporal cost—the absence of statistical significance did not permit the inference of a relationship between SA and a higher cost in the interpretation of ambiguous faces as positive and the higher cost needed to interpret positive faces.

Another goal of the study was to examine the relationship between interpretation biases and LMS. The findings indicate that those who interpreted more ambiguous images as negative had a higher social LMS score. These results are in line with the study of Riskind and colleagues (2000) that associated the LMS with three different tasks of interpretation and memory bias. Contrary to what had been hypothesized, this relationship is present in both social and physical (e.g., heart attack sensations) scenarios. However, the time required to interpret neutral images as positive was only associated with the social LMS. These data support studies that relate LSM to more SA symptoms (Brown & Stopa, 2008; González-Díez et al., 2014, 2017). Conceptually, both LMS and interpretation bias are cognitive styles that drive negative interpretations of the context (Riskind et al., 2000). Although the data must be treated with caution due to the small size of the relationships between this vulnerability and the scores in the task, this is the first study that associated LMS with the interpretation bias assessed with ambiguous faces. The data from this study add more evidence to the relevance of the LMS in the relationship of variables associated with SA (Brown & Stopa, 2008; Calvete et al., 2016) and, more specifically, to cognitive biases (Riskind et al., 2000). This study opens the possibility for future studies to explore if LMS, as hypothesized, influences, or guides interpretation biases.

Limitation and Future Studies

Some limitations require discussion. The first problem was the high dropout rate for the laboratory task; of 130 invited students, only 84 completed the laboratory assessment (test–retest analysis). The sample that accepted the laboratory assessment could have certain characteristics; for example, people who agreed to participate were able to be more interested in the study and, therefore, follow the instructions more enthusiastically in both online and laboratory contexts. In the case of different characteristics between people that accepted participating and those who did not accept can be a problem in the generalization of the results. Second, the first wave was completed entirely online, so it is unknown whether or not the participants followed the guidelines indicated to carry out the task (e.g., not performing any other action that could be distracting). Likewise, one problem with online assessments, in general, is that it is hard to control the diversity of the devices that are used. For example, participants who used larger screens may have been able to capture more detail of the images compared to participants with smaller screens. Finally, although this was not an objective of the study, due to the variability between studies in the task methodologies, it is not possible to conclude whether the changes made in the present study improved the capture of the interpretation bias. To do that, an experimental study comparing different methodologies is needed. Thus, considering the limitations, future studies could analyze the reliability in both online and laboratory contexts (both waves in online form or both in the laboratory). With the objective of answering other questions related to the use of tasks that measure interpretation biases with ambiguous faces, future studies also could compare different types of tasks with different stimuli or question types (e.g., emotion identification vs. face valence).

Conclusion

People with greater SA symptoms and greater depressive symptoms interpreted more morphed faces as negative. Moreover, the interpretation biases of ambiguous faces were associated with LMS. Finally, one of the main contributions is related to the examination of the test–retest reliability of the emotion-recognition task. The results provide evidence of the validity and reliability of the task. Thus, the task could be used over time, for example, for longitudinal studies or to assess the effects of an intervention on cognitive biases. Although more studies are needed to examine the reliability of this type of task, the reliability and validity data are satisfactory and support that the task can be used longitudinally.