Introduction

Imagine that you are expecting to meet your parents at the airport’s landing gate. Will you be faster to notice your parents’ arrival when both of their faces (mom’s and dad’s) show up in the gate, compared with a case in which only one of their faces (dad’s) appears? Should your ability to detect your parents’ arrival improve due to the redundancy of faces? This mundane example captures important practical and theoretical issues in the study of face perception, which the current study aims to address. Faces convey a great deal of information regarding a host of physical and social dimensions (Bruce & Young, 1986; Fitousi, 2020b). Efficient detection and recognition of faces is critical for our survival. It therefore comes with no surprise that many researchers ascribe faces a special status (Diamond & Carey, 1986; Farah et al., 1998; Palermo & Rhodes, 2007; Yin, 1969; Young et al., 1987). They argue that faces (a) are processed holistically (Tanaka & Farah, 1993), (b) capture attention automatically (Langton et al., 2008), (c) draw on minimal mental resources (Lavie et al., 2003), and (d) induce parallel rather than serial processing (Hansen & Hansen, 1988; Hershler & Hochstein, 2005). However, these hypotheses are not universally accepted. Several researchers cast doubt on some, or all, of these claims (Fitousi, 2015; VanRullen, 2006). After all, faces are relatively complex stimuli that pose stringent demands on our perceptual and computational systems (VanRullen, 2006). As such, they are expected to exhaust mental resources, and be processed in a serial fashion. The current study focuses on Hypotheses (b)–(d) and set to test them in a rigorous fashion.Footnote 1

Here, I have examined the attentional characteristics of faces in a visual detection task known as the redundant target task (Colonius & Diederich, 2004, 2020; Diederich & Colonius, 1991; Fitousi & Wenger, 2013; Miller, 1982; Townsend & Nozawa, 1995, 1997). Participants were asked to detect a face target appearing on a neutral background. In some portion of the trials, two faces (of different personal identities), instead of one, showed up (i.e., double-target). The question of interest was whether, and how, performance is affected by such redundancy. This experimental design is simple, but can uncover fundamental characteristics of processing, especially if analyzed with a state-of-the-art methodology known as the System Factorial Technology (SFT;, Townsend & Nozawa, 1995). SFT is a set of allied theory and methodology developed to uncover the foundational questions of processing. It addresses four independent processing characteristics: (a) architecture (serial vs. parallel), (b) stopping rule (exhaustive vs. self-terminating), (c) capacity (limited, unlimited, or supercapacity), and (d) independence (correlated channels vs. independent channels). SFT can therefore provide a strong test of the hypotheses presented at the outset.

Previous SFT investigations of face perception (Cheng et al., 2018; Donnelly et al., 2012; Fific & Townsend, 2010; Fitousi, 2015; Ingvalson & Wenger, 2005; Wenger & Townsend, 2006; Yang et al., 2018) were particularly interested in the question of facial holism. They have examined the dimensional integration of face parts or features within a single face. The current study, in contrast, is concerned with the processing of multiple faces, examining how information is integrated from several faces presented at once. To anticipate the outcomes, I find that redundant faces are processed in a serial, self-terminating fashion, and are subjected to limited capacity. These results challenge the idea that faces are processed automatically and without attention.

Do faces require attention?

Vision scientists distinguish between low-level and high-level features (Julesz, 1984; Treisman & Gelade, 1980). The former refer to simple features such as color and shape, while the latter to complex sets of features, such as faces and objects. It is assumed that the two classes differ with respect to (a) the amount of attention they require, and (b) their processing architecture (parallel vs. serial). Faces naturally belong in the high-level category of stimuli, and it is expected that they will exact a toll on mental resources and be processed according to a serial architecture (VanRullen, 2006). On the other hand, faces are crucial for survival, and as such, may be processed automatically, in parallel and with minimal mental resources (New et al., 2007). Researchers have not reached a consensus on this issue, although the balance of the evidence tends toward the view that face processing is automatic—namely, mandatory, inattentive, resource-free, and parallel (Palermo & Rhodes, 2007).

One line of research that speaks to the attentional requirements of faces comes from studies that measure the influence of a distractor face on the response to nonface targets (Langton et al., 2008; Ro et al., 2007). In these studies, participants look for a target (nonface) object that is embedded among other nonface objects. Occasionally, a face is presented among these distractors. The mere presence of a face in the display is sufficient to cause an increase in mean RTs, compared with displays in which a face does not appear. These studies suggest that face perception is mandatory and automatic. According to this view, faces attract attention even when they are not part of the task at hand, just like low-level color or shape singletons (Theeuwes, 1992).

A study by Lavie et al. (Lavie et al., 2003) has employed a conflict task with object or face names as targets and pictures as distractors. In addition, they have manipulated the degree of perceptual load of the display. A typical trial presented a word (e.g., trumpet) flanked by a picture of either a congruent (e.g., trumpet) or incongruent (e.g., violin) object. In a low-load trial, the target word was presented alone, whereas in a high-load trial the target word was embedded among several other words. Comparable displays with faces presented a name of a familiar person (e.g., Tom Cruise) as a target and a face (e.g., George Clooney) as distractor. The results showed that high perceptual load eliminated the flanker interference for target objects, but not for target faces. Lavie et al. argued that for object names, perceptual load exhausted mental resources and therefore increased selective attention to the target word. As for faces, Lavie et al claimed that this perceptual load mechanism (Lavie, 1995) does not operate on faces, because faces do not exhaust much of the mental resources, making them immune to the influence of perceptual load (but see Fitousi & Wenger, 2011).

Another source of evidence for parallel, mandatory (exhaustive) and unlimited-capacity processing with faces comes from studies on statistical processing of ensembles of faces. When confronted with a set of faces, people can briefly and accurately estimate the average identity, emotion, or gender of the set (Haberman & Whitney, 2007; Leib et al., 2014), a remarkable ability comparable to that found with other low-level sets of features (Ariely, 2001; Chong & Treisman, 2003). The existence of a rich and detailed representation of multiple faces in a scene lends some support to the view that faces are processed according to a parallel-coactive, exhaustive, and capacity-unlimited (or even supercapacity) fashion. In a coactive system (Miller, 1982) the evidence from several faces feeds into a single summation accumulator, and response is activated when this accumulator reaches a threshold. Note that for a brief and accurate estimation of the mean of a set of objects, the observer must process all of the items in the set. Moreover, the evidence from all items should be somehow integrated. The latter entails coactive or interactive architectures.

Won and Jiang (2013) tested how redundancy influences ensemble coding with faces. They have presented observers with either a single face or four faces. Participants were faster to judge the facial expression and gender of displays with multiple faces than displays with a single face. To test whether this redundancy gain was due to increased strength in perceptual representation, the authors measured the magnitude of the facial expression after-effect. They found it to be of equal magnitude with single-face and with four-faces displays, ruling out an increase in perceptual strength due to coactivation, while supporting parallel processing due to statistical summation (Miller, 1982). The results of these authors are not fully diagnostic, since they have not performed the routine distributional analyses, which include the Miller’s inequality (Miller, 1982) and the capacity coefficient (Townsend & Nozawa, 1995; Townsend & Wenger, 2004b), to decide between the two models. These analyses and related formal approaches are applied in the current study.

Visual search with faces: Serial or parallel?

An important line of research that addresses the attentional requirements of faces comes from studies that harnessed visual search slopes (Brown et al., 1997; Hansen & Hansen, 1988; Hershler & Hochstein, 2005; Kuehn & Jolicoeur, 1994; Nothdurft, 1993; Purcell et al., 1996; Suzuki & Cavanagh, 1995). These studies have asked how efficiently people can search for a face target among other distractors. At least since feature integration theory (Treisman & Gelade, 1980), researchers have conjectured that a search for combinations of features and high-level features, such as faces, requires attention, and is accomplished via a serial architecture, while search for low-level features (e.g., dots, lines) may not require attention, and is held in parallel (Pashler, 1987; Wolfe, 1994). It should be noted though that this dichotomy has been recently questioned. Several studies have shown that color target does require attention and that it is processed serially (Fitousi, 2019a; Wolfe & Horowitz, 2017).

To distinguish serial from parallel processing researchers have used visual search slopes. This methodology examines how mean RTs are affected by the number of items in the display (Duncan & Humphreys, 1989; Wolfe, 1994). Parallel search is marked by flat search slopes, in which search time is independent of set size. In such cases it is assumed that the target “pops out.” Serial search is identified by increasing search slopes, which indicate that a constant amount of time is required to process additional items in the display. This logic has been severely criticized for being inadequate (Algom et al., 2015; Townsend, 1971, 1990; Townsend & Wenger, 2004a). However, a large body of research on faces relies on this approach.

Typically, in the visual-search methodology, researchers have presented participants with a face target embedded among nonface distractors, while manipulating the number of distractors (i.e., display set size), and recording search response times. Some of these studies have recorded flat visual search slopes, interpreting them as evidence for preattentive parallel processing (Hansen & Hansen, 1988; Hershler & Hochstein, 2005; Suzuki & Cavanagh, 1995), while others have documented linearly increasing search slopes, interpreting them as evidence for attentive serial processing (Brown et al., 1997; Kuehn & Jolicoeur, 1994; Nothdurft, 1993; Purcell et al., 1996).

The glaring disagreement among researchers concerning the architecture of visual search for faces can be best exemplified by the work of Hershler and Hochstein (2005) and the follow-up study by VanRullen (2006). The former asked their participants to search for a face (e.g., target) among objects (e.g., distractors), or an object among faces. They found that for face search, mean RTs remained constant across levels of set size (zero slope search). This pop out effect was not observed for objects, suggesting that faces are processed preattentively, holistically, and in parallel. However, VanRullen (2006) showed that in a blocked design, object search can also produce the pop out effect. Moreover, by controlling for low-level visual aspects, through manipulation of inversion or Fourier transformation, the pop-out effect was eliminated. VanRullen concluded that “in practice . . . there can never be a truly high-level parallel search, and . . . any report of high-level pop out effect . . . must conceal one or more low-level confounds” (p. 3025).

As noted earlier, the logic sustaining the search slopes methodology has been proven to be flawed (Algom et al., 2015; Townsend, 1971, 1990; Townsend & Wenger, 2004a). The major issue that thwarts this methodology is often called the mimicry problem (Algom et al., 2015; Townsend, 1990). Mathematical derivations (Townsend, 1971; Townsend & Wenger, 2004a) clearly show that a given empirical search slope can be generated (or mimicked), not by one, but by several competing theoretical architectures. Many researchers have ignored proof that a serial mean RT function can be mimicked by a parallel one and vice versa. Take, for example, one of the central tenets of this logic that increasing visual search slope is a marker of serial processing. Townsend has demonstrated, at a general level and absent distributional assumptions, that such a pattern can be mimicked by a parallel system with limited capacity (Townsend, 1971). The upshot is that arguments in favor of a given architecture (serial vs. parallel) that are based on the shape of the empirical search slope are incorrect.

Beyond this mimicry problem, it is now clear that architecture (serial versus parallel) and capacity (limited, unlimited, super) are two independent and testable aspects of processing (Fitousi & Algom, 2018, 2020; Townsend & Wenger, 2004a). The work of Townsend and colleagues (Algom et al., 2015; Townsend, 1971, 1990; Townsend & Wenger, 2004a) has shown that a correct and complete account of search slopes requires consideration of at least three independent aspects of processing: architecture (serial, parallel), stopping rule (self-terminating, exhaustive), and capacity (limited, unlimited, super). Therefore, the conclusions drawn from the visual search tasks with faces might not be informative. What is needed is a theory-based approach that can provide strong and independent tests of architecture and processing capacity within a unified conceptual framework. The SFT is such an approach. The current study has deployed SFT (Townsend & Nozawa, 1995) to uncover the basic processing characteristic of displays with multiple faces.

A synopsis of SFT applied to face detection

SFT addresses three fundamental processing characteristics: (a) architecture (parallel, serial, coactive), (b) stopping rule (self-terminating, exhaustive), and (c) effects of variations in workload (capacity, limited, unlimited, super). The SFT has been applied to a variety of simple visual-detection or memory-search tasks (Algom et al., 2017; Eidels et al., 2010). In the current study, the task was a detection task with faces.

I have used a version of the SFT known as the double factorial paradigm (DFP; Townsend & Nozawa, 1995). I illustrate the paradigm as it applies to the face stimuli in Fig. 3. The first factor in this double factorial design is the location of the face. There are four types of trials. In a single-target trial, one face on the right or on the left side of fixation is presented. In a double-target trial, two faces are presented, one to the left and one to the right of fixation. In a no-target trial, the display does not contain a target. Participants decided, while timed, whether a face target was present (YES) or absent (NO). This is an OR version of the task because a YES response requires that a face be detected either on the left, or on the right, or on both sides.

In this design, the location of the face (left, right) served as one factor. An important aspect of DFP is that a second factor of salience is manipulated independently of the first. This has been accomplished by either increasing or decreasing independently the visibility of the face or faces on the right and left sides. In the current study, this has been accomplished by degrading the face (in the low-salience condition) with a masking pattern. The salience manipulation is assumed to selectively influence the respective processing of the right and left faces (Dzhafarov, 1999; Sternberg, 1969; Townsend & Nozawa, 1995). This selective influence amounts to faster and more efficient processing of salient targets than less-salient targets on each location. Moreover, rendering the left face more salient does not influence the processing of the right face, and presenting a salient right face does not affect the processing of the left face. Selective influence was tested via the ordering on the survivor functions (Townsend & Nozawa, 1995).

Architecture and stopping rule

Figure 1a depicts five possible combinations of architectures (serial, parallel) and stopping rules (self-terminating, exhaustive) that can be identified through SFT: (1) in a serial self-terminating (minimum-time) system, the two faces are processed one after the other, and processing halts when the processing of the first face target is completed; (2) in a serial exhaustive system, the faces are processed one after the other, and processing of the entire set is mandatory; (3) in a parallel self-terminating system, all faces are processed simultaneously, and processing halts when a target is detected; (4) in a parallel exhaustive system, the faces are processed simultaneously, and a response is emitted when processing of both targets is completed; and finally, (5) in a coactive system, which is a special case of a parallel system, the evidence from the two faces is integrated, and response is emitted only when the evidence for this summation crosses a predefined bound.

Fig. 1
figure 1

Five possible architectures and their stopping rule diagnosed in SFT. a Graphic illustration of the architecture. b Their signature mean interaction contrast, MIC pattern. c Their signature survivor interaction contrast, SIC pattern

Inference regarding architecture and stopping rule is framed at the level of the mean, and at the level of the distribution. Let the subscripts H and L indicate the level for each location (left, right), with H to indicate that the face on that location was of high salience and L to indicate that the dimension was of low salience. Let the order of the subscripts be state of left first, state of right second. Thus,\( \overline{RT} \)LH will refer to the mean RT when the left face is in low salience and the right face is in high salience. The first RT measure is an interaction contrast computed on the means of the 2 × 2 factorial design involving the salience and location manipulations. This interaction contrast is comparable to that developed by Sternberg, and which played a central role in his additive factors logic approach (Sternberg, 1969). The interaction contrast at the level of mean RT (MIC) is defined as:

$$ MIC={\overline{RT}}_{\mathrm{LL}}-{\overline{RT}}_{\mathrm{LH}}-{\overline{RT}}_{\mathrm{HL}}-{\overline{RT}}_{\mathrm{HH},} $$
(1)

The MIC can be either zero, positive, or negative. When MIC = 0, clearly there is no interaction. When MIC > 0, the interaction is overadditive, and when MIC < 0, it is underadditive.

The factorial plots depicted in Fig. 1b provide clues concerning the processing architecture and stopping rule. If processing is serial, the MIC should equal zero, regardless of the actual stopping rule (self-terminating or exhaustive). If processing is parallel self-terminating or coactive, the interaction should be overadditive (MIC > 0). This is due to the single combination LL, which is expected to produce significantly slower RTs than the other three combinations (HL, LH, HH). Lastly, parallel exhaustive processing should lead to an underadditive pattern. This is because RT in this stopping rule is determined by the slower of the two locations, which entails fast performance only for the HH combination. Formal proofs of these intuitions can be found in Townsend and Nozawa (1995).

The second RT measure in SFT is the corresponding distributional contrast. This contrast is framed at the level of the survivor function of the RT distribution:

$$ S(t)=P\left(T>t\right) $$
(2)
$$ =1-P\left(T\le t\right) $$
(3)
$$ =1-F(t), $$
(4)

where F(t) is the cumulative distribution function. This survivor function interaction contrast (SIC) is defined as:

$$ SIC(t)={S}_{LL}(t)-{S}_{LH}(t)-{S}_{HL}(t)+{S}_{HH}(t). $$
(5)

In contrast to the MIC, the SIC(t) is a continuous function that depicts for each t, the interaction contrast computed on the survivor function. For serial self-terminating processing, the SIC(t) is equal to zero at all time t. For serial exhaustive processing, the SIC(t) is negative at early values of t, and positive at later values, with the total area spanning the two parts being equal. The SIC(t) produces distinctive predictions with respect to parallel self-terminating and parallel coactive processing. For the former, it is all positive, whereas for the latter, it is starting negative and ends positive. The SIC(t) for the parallel exhaustive is all negative. Table 1 summarizes the predictions for architecture and stopping rules based on the combined MIC and SIC(t) patterns.

Table 1 Summary of mean (MIC) and survivor function interaction contrasts (SIC) for alternative architectures and stopping rules

Capacity: The capacity coefficient

The basic factorial design (see Fig. 3) can be used to measure capacity. The question of capacity is cast in SFT in terms of the influence of work load on processing efficiency. How does processing efficiency change as a function of the number of target faces in the display? Do participants double their effort when detecting two faces compared with one? The capacity coefficient for the OR task can answer such questions. It is defined as:

$$ {C}_{OR}(t)=\frac{H_{LR}(t)}{H_L(t)+{H}_R(t)\ } $$
(6)

where:

$$ H(t)={\int}_0^{\infty }h(t) dt $$
(7)
$$ =-\ln \left[S(t)\right], $$
(8)

is the integrated hazard function, a measure of the cumulative level of work accomplished by time t, which is equal to the negative log of the survivor function (Townsend & Ashby, 1978). When C(t) = 1, the system is of unlimited capacity, whereas C(t) < 1 indicates that the system if of limited capacity, and C(t) > 1 indicates supercapacity (Townsend & Eidels, 2011).

Capacity: The Miller and Grice inequalities

Is performance better with double-target faces than with a single-target face? Do observers benefit from the redundant target condition? The race-model inequality suggested by Miller (1982) can answer these questions:

$$ {F}_{LR}(t)\le \kern0.5em {F}_L(t)+{F}_R(t) $$
(9)

in which FLR(t) is the cumulative distribution function (CDF) for the double-target condition, while FL(t) and FR(t) are the CDFs for the single-target conditions on the left and right locations, respectively. The inequality entails that for any point in time t, performance for the double-target cannot exceed the combined performances in the two single-target conditions, as indicated by the combination of their CDFs, if indeed detection of a face on the right is independent on detection of a face on the left, and vice versa. Violations of the inequality have been interpreted as evidence against parallel race, and in favor of coactivation (but see Townsend & Nozawa, 1997). The Miller inequality sets an upper bound on performance with redundant-target displays. A lower bound on performance is provided by the Grice inequality (Grice et al., 1984):

$$ {F}_{LR}(t)\ge \kern0.5em \operatorname{MAX}\left[{F}_L(t),{F}_R(t)\right] $$
(10)

According to the Grice bound, performance in the double-target condition should be equal or better than the best of performances in either of the single-target conditions. Thus, for any point in time t, the fastest of the two single-target CDF cannot exceed that of the double target. Violation of the Grice inequality entails extremely limited-capacity of the system (Townsend & Wenger, 2004b).

Capacity: Unified space for capacity measures

Townsend and Eidels (2011) have shown that the Miller and Grice inequalities can be expressed in terms of the capacity coefficient, and thus be plotted on a unified space. The Miller and the Grice inequalities take the following respective forms:

$$ C(t)\le \frac{\log \left[{S}_L(t)+{S}_R(t)-1\right]}{\log \left[{S}_L(t)\times {S}_R(t)\right]} $$
(11)
$$ C(t)\ge \frac{\log \left(\operatorname{MIN}\left[{S}_L(t),{S}_R(t)\right]\right)}{\log \left[{S}_L(t)\times {S}_R(t)\right]} $$
(12)

In this unified space, a violation of the Miller inequality is indicated when the capacity coefficient function hovers above the Miller bound. A violation of the Grice inequality occurs if the Grice bound is located above the capacity coefficient function.

SFT interrogation of face detection

The goal of the present study is to test a set of contrasting hypotheses regarding the architecture, stopping rule, and capacity characteristics underlying the perception of multiple faces. In particular, the aim here is to examine how multiple (more than one) faces are processed. According to one set of hypotheses, multiple-face processing is automatic, mandatory (cannot be stopped once it has started), inattentive, parallel, and drawing on minimal resources (Hansen & Hansen, 1988; Hershler & Hochstein, 2005). Translated onto the language of SFT, this view predicts that processing with redundant target faces should be held according to a parallel exhaustive or coactive architecture, with unlimited or supercapacity. An alternative set of hypotheses postulates that faces are complex stimuli that pose high demands on our perceptual systems, and as such they should be processed in a controlled, attentive, serial, and capacity-limited fashion (VanRullen, 2006). In the terminology of SFT, these hypotheses imply a serial, self-terminating system, with limited capacity. The SFT analyses will afford us to decide between the two contrasting theoretical stances.

As noted earlier, the current study is not concerned with the dimensional integration of face parts per se, but rather with the detection of multiple faces presented at once. The imperative stimulus in earlier SFT studies (Algom & Fitousi, 2016; Donnelly et al., 2012; Fific & Townsend, 2010; Fitousi, 2015; Ingvalson & Wenger, 2005; Wenger & Townsend, 2006; Yang et al., 2018) consisted of a single face, for which the number and salience of constituent individual features or parts were manipulated, whereas here the number and salience of entire faces are manipulated. Nonetheless, it will be still possible to address the question of holism. If the perception of a single face is indeed holistic, in the sense that it is fast and automatic (Farah et al., 1998), then processing of two faces should be holistic as well—namely, held according to a parallel or coactive architecture, and with unlimited or even supercapacity. If, on the other hand, perception of a single face is analytic (Cheng et al., 2018; Fitousi, 2015, 2016, 2019b, 2020a), then processing of two faces should be serial and capacity limited.

Method

Participants

Six observers from Ariel University participated (all females, mean age = 22.66 years). They were compensated with a course credit. All of them reported normal or corrected-to-normal vision. The sample size was determined based on previous SFT studies (Fitousi & Algom, 2020; Townsend & Nozawa, 1995), where individual-by-individual analyses are performed on a large set of data from a relatively small number of observers (usually four observers). The approach is well justified here because any one of our observers can potentially employ one of several possible combinations of architecture, stopping rule, and capacity level. Moreover, averaging across observers can conceal the true model (Ashby et al., 1994; Siegler, 1987; Smith & Little, 2018).

Stimuli and apparatus

The stimuli consisted of two male faces with neutral expression (see Fig. 2). These images were retrieved with permission from the Karolinska directed emotional face (KDEF) database (Lundqvist et al., 1998). The images were altered with the free GIMP software. Viewed at a fixed distance of 56 cm, the images subtended 2.0ο of visual angle horizontally and 2.86ο vertically. The faces were grayscale images presented over black background. The fixation point was a white dot. Each face could serve as a target on either the left or right side of fixation with equal probability. The distance of each face from fixation subtended 0.51ο of visual angle.

Fig. 2
figure 2

The two identities presented in the experiment

Design and procedure

The design of the experiment followed the classic DFP structure (Townsend & Nozawa, 1995). A schematic illustration of the design can be seen in Fig. 3. It consisted of two nested factorial designs. An outer design, with target (present, absent), face location (left, right), and salience (high, low) as factors. This design created the four cells necessary for conducting the redundant target experiment and computations of the Miller/Grice/Capacity measures. Salience was manipulated for single and double target cells, resulting in the following five types of trials: (1) no target, (2) single-left target high salience, (3) single-left target low salience, (4) single-right target high salience, and (5) single-right target low salience. The second (nested) inner design was applied to the double-target cell. This inner design, which applies only to double targets, is necessary for making inferences regarding architecture and stopping rule. Here, the salience of the left and right faces (targets) was increased or decreased independently. This yielded four types of trials: (a) both face targets were highly salient (HH); (b) both were of low salience (LL); (c) the left face was of low salience, while the right face was of high salience (LH); and (d) the left face was of high salience, while the right face was of low salience (HL). Note that the salience manipulation is applied to both double-target and single-targets trials (Townsend & Nozawa, 1995) to achieve a better experimental control.

Fig. 3
figure 3

An application of the SFT to face detection. The design consisted of two nested factorial designs. The outer design manipulates the number of face targets (0, 1) on the left and right positions. The inner design manipulates the salience of the left and right face targets in the double-target cell. Manipulations of salience were also administrated in the single-target conditions. The no-target cell consists of four types of trials, manipulating the appearance of masking pattern(s), but with not face target(s) whatsoever (see text). The double target cells always consisted of two different identities. The appearance of the masking pattern here is not exactly as in  the experiment due to differences in resolution in the two mediums. The brightness levels of masking patterns in the target and no-target displays were comparable

Each one of the identities in Fig. 2 could serve as a target. In the double-target condition, both faces (identities) were presented. On half of the trials, Person A appeared on the left side and Person B on the right side. On the other half, the locations of identities were reversed, such that Person A appeared on the right side and Person B on the left side. It is important to note that the double-target trials always presented faces from two different identities (e.g., Person A, Person B). None of the double-target trials consisted of a redundant identity. The goal is to exclude redundancy due to sheer perceptual (low level) sameness. In the single-target condition, one of the identities showed up either on the left or the right side. Each identity could appear equally often on each side, while the contralateral location remained empty. Participants were asked to search the display for a face, not a particular identity. Redundancy in the current design relates to the basic level (face) rather than the subordinate level (e.g., Identity A). Another important point to highlight is that no distractors whatsoever were presented in the double-target or single-target displays.

The salience manipulation was administrated using a masking pattern that reduced the visibility of the face in the low-salience condition. Extensive prior testing was carried out to find the exact pattern that could slow down performance, induce selective influence on the left and right faces, but keep accuracy relatively high. The participants were tested individually in a dark and quite room. They were instructed to give a positive response (YES) if they detected a target face on either the left, right, or both sides of fixation, and a negative response (NO) if they perceived no face. Participants pressed the “m” key for a YES response and the “z” key for a NO response. Participants were instructed to respond as fast and as accurately as they could.

In the original design of the SFT (Townsend & Nozawa, 1995), the task of the observer was to respond YES if he or she detected bright target or targets on the screen. However, in the context of the current study, there is the risk that participants would respond YES just by detecting changes in brightness on the screen, without any attempt to search for faces. The solution to this problem is to equate target-present and target-absent trials on brightness levels, such that observers would not be able to discriminate between the two types of trials based on brightness, but will have to search for a face. Therefore, I presented the same masking pattern that was used in the salience manipulation also in the no-target trials. The pattern could appear equally often either on the left side of fixation (0.25), on the right side (0.25), on both sides (0.25), or neither of them (0.25). This manipulation impelled participants to actively search for faces. Half of the trials in the experiment were no-target trials, with three-fourths of them including masking pattern or patters. Thus, participants could not know in advance, based on changes in brightness level, whether a given trial included a target face or not. Although this manipulation slightly deviates from the original SFT design, it does not alter the logic sustaining the method nor its overall validity. All SFT computations are based on the double-target and single-target trials; the no-target condition plays no role in these calculations.

Testing was held along two consecutive days. A typical block consisted of 32 trials. Participants completed 40 blocks of trials on each day. In total, each participant completed 2,560 trials. To prevent a bias in response to either of the stimuli, frequencies of target and nontarget trials were equated, as were frequencies of single- and double-target trials. The probability of a no-target trial was 0.5. The overall probability of a target-present trial was also 0.5. The various types of target-present trials were distributed according to the following: The probability of a double target trial was 0.25 (with equal probability of 0.0625 for HH, HL, LH, and LL trials). The probability of a left single-target trial was 0.125 (with equal probability of 0.0625 for high-salience target and low-salience target). The probability of a right single-target trial was 0.125 (with equal probability of 0.0625 for high salience target and low salience target).

Each trial began with a fixation point for the duration of 200 ms. It was then followed by a screen that presented the stimuli (target or no target). In the target trials, grayscale image or images of a face or faces with or without a masking patterns (depending on trial type) were presented on the left or/and right of the fixation point. In the no-target trials, either a black screen (with fixation point on the center) or one of three possible configuration of masking patterns reported earlier could appear either on the left or/and right of fixation (see Fig. 3). The stimuli were presented until response. The intertrial interval was 500 ms. The participant then responded with one of the two specified response keys (“m”; “z”). Latency and accuracy were recorded. No feedback was provided. A 1-min break was given after each block of trials.

Results

Response times longer than 2,800 ms or shorter than 150 ms were removed. This exclusion criteria led to the removal of less than 1% of the trials for each of the six observers. The RT analyses were confined to correct responses only. Error responses amounted to 7.5%, 4.6%, 5.7%, 4.4%, 5.1%, and 7.1% of the data, for the six observers, respectively. Tables 2 and 3 give the mean RT and mean error rate in the experimental conditions. The four types of no-target trials were collapsed. The redundant target effect (RTE) was computed as the difference in performance between the double-target condition and the fastest of the two single-target conditions (Miller, 1982). As can be seen in Table 2, none of the participants exhibited a significant RTE (all ps > .05). Participants did not reap gain from the redundant presentation of two faces compared with a single face.

Table 2 Mean RT (in ms) for double-target, single-target and no-target displays (H = high salience, L = low salience, with left-hand letter standing for left face target and right-hand letter standing for right face target)
Table 3 Error rates (in percentage) for double-target, single-target and no-target displays (H = high salience, L = low salience, with left-hand letter standing for left face target and right-hand letter standing for right face target)

Table 4 presents the analyses of the primary data of interest: double targets with varying levels of salience along the location factor. The results of the analysis of variance (ANOVA) are consistent across all participants. The main effects of left location and right location were highly significant. However, the most important feature of the data is the absence of a Left Location × Right Location interaction. These results are most consistent with a serial architecture.

Table 4 Selective influence tested with ANOVA on mean RT, for each of the observers (O)

Mean-based analyses

The MICs for the subset of redundant target trials for each of the individual participants are presented in Fig. 4. A glimpse at Fig. 4 reveals that the functions are highly consistent across all observers. All exhibit an additive pattern, which was further supported by the ANOVA analyses. This pattern provides strong evidence for serial processing.

Fig. 4
figure 4

The mean interaction contrast (MIC) patterns for each of the six participants

Distribution-based analyses

To allow further specification of the underlying architecture and stopping rule, I calculated the survivor interaction contrast (SIC) function for each participant. Figure 5 presents these SIC(t) functions. As can be noted, most of the SIC functions are zero, a pattern that is consistent with a serial self-terminating architecture; however, for some observes, the SICs are exhibiting a slight increase or decrease above or below zero. These patterns might or might not reflect other architectures. To estimate the degree to which the SIC functions deviated from zero, I have used a statistical test developed by Houpt and Townsend (2010). The test relies on two separate null hypotheses tests, one examines whether the largest positive value of the SIC is significantly different from zero, and the second examines whether the largest negative value (in absolute magnitude) is significantly different from zero. The test uses the supremum of these differences to make the inference (see Houpt & Burns, 2017, p. 57). The following pairs of D^+ and D^− values were observed for the six participants, respectively: (0.10,0.14), (0.12, 0.07), (0.16,0.16), (0.16,0.15), (0.11,0.24), and (0.09,0.15). All these values, except one (the SIC for Observer 5 was below 0 at some time, p = .007), were not significant (p > .05). These results entail SIC functions that are zero and are most consistent with serial self-terminating processing.

Fig. 5
figure 5

The survivor interaction contrast (SIC) patterns for each of the six participants

Selective influence: Ordering of survivor functions

A necessary (but not sufficient) condition for the validation of the SFT analyses is the presence of selective influence (Dzhafarov, 1999, 2003). I calculated the survivor functions for the four redundant target conditions and tested whether they satisfied their predicted order. Figure 6 presents the four survivor functions for each participant. Note that the predicted ordering HH > LH, HL > LL holds for each of the individual participants. This result supports the presence of selective influence in the data. Table 5 gives the outcome of a set of Kolmogorov–Smirnov tests performed on these survivor functions (Fitousi & Algom, 2018; Townsend & Nozawa, 1995). The results of these analyses further support selective influence in the data.

Fig. 6
figure 6

Estimated survivor functions in the redundant target salience conditions for each of the six observers

Table 5 One sided Kolmogorov–Smirnov Dˆ values on the survivor functions

Measuring capacity

I computed the capacity coefficient, Cor(t), the Miller inequality, and the Grice inequality, separately for each participant. The three measures are presented on a unified capacity space (see Fig. 7), using the routine developed by Townsend and Eidels (2011). A glimpse at Fig. 7 reveals a consistent pattern for all participants. The capacity coefficient for all observers is smaller than 1 (C(t) < 1), under most values of t. Moreover, the C(t) values are often very close to 0.5, a pattern that indicates extremely limited capacity. This conclusion is further corroborated by the observation that the Grice inequality was violated by all participants, while the Miller inequality was generally not. Another source of evidence for serial limited capacity processing is afforded by a recent test developed by Houpt and Townsend (2012). This test compares the observed RT with those predicted by an unlimited capacity independent parallel model (UCIP). The z scores were all negative (−9.45, −8.81, −10.7, −8.80, −10.31, and −8.30), and significant (all z < .001). These results provide strong evidence for C(t) that is smaller than 1 and support a system of serial limited-capacity processing.

Fig. 7
figure 7

Capacity coefficient space for each of the six participants.

General discussion

Many researchers argue that face detection is mandatory, inattentive, parallel, and resource free (Hansen & Hansen, 1988; Hershler & Hochstein, 2005; Lavie et al., 2003). The current study casts doubt on these claims. The application of a rigorous theory-driven approach—the SFT—uncovered four major results. First, participants did not incur benefit from the redundancy of faces in the display (no redundant target effect). Second, participants detected faces according to a serial architecture. Third, the operative stopping rule was self-termination. Fourth, detection was limited capacity. These findings are inconsistent with the claims that multiple face recognition is automatic, mandatory, inattentive, parallel coactive, and unlimited or super in capacity (Hershler & Hochstein, 2005; Nothdurft, 1993; Suzuki & Cavanagh, 1995). The findings accord well with the alternative hypothesis—namely, that face detection is a serial and controlled process, that demands considerable attention and resources (Bindemann et al., 2005; Fitousi, 2020a; Nothdurft, 1993; VanRullen, 2006).

The important message of the current study is that faces are subjected to limitations of resources, even when the task at hand is relatively simple and accuracy levels are high. This conclusion can further reflect on the hotly debated issue of whether faces are processed holistically or analytically (Cheng et al., 2018; Fitousi, 2015, 2016; Richler et al., 2008; Rossion, 2013; Von Der Heide et al., 2018). According to the holistic account (Farah et al., 1998; Young et al., 1987), facial features or parts are perceived as global wholes rather than independent entities. Faces are considered as the epitomes of Gestalt, stimuli for which the whole is greater than its parts (Rossion, 2013). It is assumed that the role of facial holism is to afford highly efficient processing (Maurer et al., 2002). If this were correct, we should have found parallel supercapacity processing with redundant-target faces. However, the current findings of serial limited-capacity processing do not support those claims. If anything, they strengthen the opposite analytic approach (Cheng et al., 2018; Fitousi, 2015, 2016, 2019b, 2020a).

The current findings have interesting implications for ensemble coding with sets of faces (Whitney & Yamanashi Leib, 2018). People can efficiently and correctly estimate the average emotion or gender of a set of faces within 500 ms. While they do not maintain a recollection of the individual faces, they can still create a rich and sophisticated representation of up to 16 faces at once. A candidate model for such a remarkable ability might be a parallel exhaustive supercapacity system. But the current results seem to be at odds with such a model. If a simple detection task of faces is so capacity limited, how can a highly sophisticated ensemble coding can take place? Future studies designed to address these issues might be able to answer this question.

Another issue that deserves a comment concerns the distinction often made between low-level and high-level features (Julesz, 1984; Treisman & Gelade, 1980). Given the current findings, it might be tempting to conclude that complex objects, such as faces, are processed in a serial limited-capacity mode, whereas simple features, such as lines and dots, are processed in a parallel and unlimited or supercapacity fashion. This maxim is one of the basic tenets of feature integration theory (Treisman & Gelade, 1980). However, this dichotomy is not well supported by SFT studies. In their seminal study, Townsend and Nozawa (1995) used simple luminance dots as stimuli. These low-level stimuli exhibited limited capacity in one experiment, but supercapacity in another (for review, see Blaha, 2017). Similarly, applying the SFT to low-level color patches, Fitousi (2019a) has documented a serial exhaustive system under one response regime (OR task), but a parallel-coactive system under another (AND task). The upshot is that we cannot conclusively argue for a distinction between low-level and high-level features in terms of architecture or capacity.

A potential criticism against the current conclusions might be that they are specific to the type of tasks and displays employed here. In a comprehensive literature review Palermo and Rhodes (2007) have surmised that whether faces exhibit limited or unlimited capacity depends on the type of task and displays used by researchers. These authors have noted that studies that employed high perceptual load displays often exhibited limited capacity, whereas studies that have deployed low perceptual load displays often found unlimited capacity. However, albeit the fact that the tasks and displays deployed in the current study exert low perceptual load, capacity was found to be limited. Moreover, unlike the paradigms reviewed by Palermo and Rhodes (2007), the methodologies administrated here were specifically designed to measure capacity.

One can still argue that using a conjunctive task (AND), in which participants respond to all the items in the display, could have resulted in supercapacity. Indeed, Blaha (2017) has noticed that supercapacity often emerges in AND designs (Fitousi & Wenger, 2013; Houpt et al., 2014; Wenger & Townsend, 2006). However, Blaha’s conclusion might be premature. A recent study by Howard et al. (2021) raised repercussions concerning the interpretation of the routine AND task and its associated capacity computations (Townsend & Wenger, 2004b). These authors show that the no-target trials can influence how target trials are processed. To remedy this problem, they have proposed a modification of the AND task and its associated capacity measures. A future goal, therefore, is to apply the modified AND task to faces, and compare the results with the current findings.