Perception of so-called Mooney images is all or none: The black and white blobs either combine into a complete face in a single percept—often within a few hundred milliseconds—or remain fully independent and abstract (Mooney, 1957). The Mooney test has seen frequent use in both neuropsychological studies and studies of face processing, and it is referred to as the “Mooney closure test,” “Mooney face test,” or simply the “Mooney test” (Bruce & Young, 2012; Busigny, Joubert, Felician, Ceccaldi, & Rossion, 2010; Kanwisher, Tong, & Nakayama, 1998; Lansdell, 1968; Milner, Corkin, & Teuber, 1968; Verhallen et al., 2014; Wasserstein, Barr, Zappulla, & Rock, 2004). Within a normal population, there are substantial individual differences in the ability to perceive Mooney faces (Foreman, 1991; Verhallen et al., 2014).

However, Mooney’s original version of the test is short (40 items), is designed to be administered by personal interview, and is not suited for test–retest estimates of reliability. Moreover, the image set is heterogeneous and shows its age (the images were created from 1950s magazine clippings). To overcome these limitations we set out to construct—from scratch—a new, online, and extended version of the Mooney test. This new test measures the ability to detect a Mooney face from among two distractors, by asking participants to click on one of the eyes of the face.

There is abundant evidence that Mooney images engage the mechanisms of face perception. The N170 component of the event-related potential, which occurs specifically in response to the image of a face (Bentin, Allison, Puce, Perez, & McCarthy, 1996), is also observed in response to the presentation of Mooney faces; and when the Mooney face is consciously perceived, the amplitude of this component is increased (George, Jemel, Fiori, Chaby, & Renault, 2005; Jeffreys, 1989; Jemel, Pisani, Calabria, Crommelinck, & Bruyer, 2003; Latinus & Taylor, 2005). Furthermore, the highly face-selective fusiform face area shows increased activity upon the conscious perception of a Mooney face, as compared to the failure to perceive it (Andrews & Schluppeck, 2004; Kanwisher et al., 1998; Rossion, Dricot, Goebel, & Busigny, 2011).

However, do the large individual differences on the Mooney test arise from differences in the specific processes of face perception, or do they rather reflect differences in “closure”—a process of perceptual organization that precedes perception of the face? It is curious, for example, that males outperform females on the Mooney test (Foreman, 1991; Verhallen et al., 2014), whereas, if a sex difference is observed in other tests of face processing, it is in favor of females (Megreya, Bindemann, & Havard, 2011). Moreover, we have found (Verhallen et al., in preparation) that performance on a three-alternative forced- choice (3AFC) version of the original Mooney test does not correlate very strongly (Spearman’s ρ = .21) with performance on a test of face discrimination (the Glasgow Face Matching Test; Burton, White, & McNeill, 2010), and correlates only modestly (Spearman’s ρ = .31) with performance on a test of face recognition (the Cambridge Face Memory Test; Duchaine & Nakayama, 2006).

Nevertheless, it is theoretically difficult to separate the processes of perceptual organization from those of face processing in the case of Mooney images. When more conventional stimuli are presented, it is possible to envisage the two processes as sequential: Perceptual organization may be driven by low-level features such as similarity, proximity, and the presence of T-junctions, and then object recognition may follow. But in the two-tone Mooney images, most of the low-level, Gestalt-prompting features are absent. There are no T-junctions, for example (Moore & Cavanagh, 1998), and the individual features of a face—eyes, nose, and mouth—are seldom independently apparent in a Mooney image. To detect the face, the observer must construct a specific, three-dimensional model of both the face and the lighting. The perception is of concave and convex regions, with cast and attached shadows. The underlying processes are likely to be top-down, and they must surely draw upon the observer’s stored knowledge of faces, acquired over a lifetime.

With the advent and improvement of software for the manipulation of digital images, the conversion of photographs into two-tone “Mooney” images has become rather easy. Several authors have created their own Mooney stimuli for specific experimental purposes (e.g., Jemel et al., 2003; McKeeff & Tong, 2007; Rossion et al., 2011). Our own purpose was to create a standardized, online, quick, and reliable new version of the Mooney test, for use by the wider academic community. However, as with any psychological test, it is not trivial to develop a reliable and internally balanced Mooney test. We describe our method of creating the stimulus set and report the results of two testing phases: An initial selection phase (316 trials, N = 374) to gather data to use subsequently in selecting the final stimulus set, and a testing phase using the final stimulus set (144 trials, N = 505) in order to establish test–retest reliability and to gather population statistics.

Performance on tests of face processing is known to depend on the race and the sex of the faces used as the stimuli, as well as on the race and the sex of the observer looking at the faces. For example, the “other-race effect” describes the impaired recognition of faces of people belonging to a different race relative to the recognition of faces of the participant’s own race (Meissner & Brigham, 2001). It remains unclear to what extent the Mooney test taps face recognition ability (see above); nevertheless, we limited ourselves to Caucasian faces in the development of our test, although our participant sample was not confined to Caucasians. Since the female advantage in face-processing studies tends to be restricted to female faces (Megreya et al., 2011; Rehnman & Herlitz, 2007; Sommer, Hildebrandt, Kunina-Habenicht, Schacht, & Wilhelm, 2013), we included equal numbers of female and male face stimuli in our test.

Development of the test materials

Volunteers for face photographs

A total of 158 Caucasian volunteers (50 % female; mean age 28 years, ranging from 20 to 80 years) were recruited from the Cambridge, UK, area via social media, online notice boards, and electronic mailing lists. In exchange for their help, volunteers were offered the possibility of having a professional studio portrait taken of themselves for their own, unrestricted use. Ethical permission for the study was given by the Cambridge University Psychology Research Ethics Committee.

Materials and procedure

We invited each volunteer to our photography studio, where we took their portrait from different angles while varying the direction and brightness of the studio lamps. The volunteers did not wear any items that would occlude the face or parts thereof, such as glasses, hats, and so forth. The background was either black or white, and these two options were alternated across volunteers. In Adobe Photoshop, a Gaussian blur of ten pixels was applied to all photographs to slightly reduce the amount of detail in the image. The photographs were subsequently converted to two-tone “Mooney images” using a threshold procedure: Any pixels with a luminance above the threshold value were converted to pure white, and any pixels with a luminance below were converted to pure black. A fixed threshold value of 110—from the range 0 (white) to 255 (black)—was used for all photographs, though variation remained in the ratio of black to white areas in the resulting Mooney images, owing to the variability in studio lighting, poses, and skin tone. The latter variability was desired, since it gave rise to a diverse set of target images. For each volunteer, we selected two images and cropped them to limit the contextual information. We thus created 316 target images in total. Distractor images were created through six custom-made procedures in Adobe Photoshop, with various combinations of rotation, polarity inversion, and superposition of the target image (see Fig. 1A). Thus, for each target image, the six custom-made procedures yielded six distractor images, from which we selected two to accompany the target image. In this way, we created 316 three-alternative forced-choice test items (see Fig. 1B); Fig. 2 shows an additional three test items.

Fig. 1
figure 1

(A1–6) The individual steps (from left to right) for each of our six custom-made procedures to create distractor images. All distractor images were created from the original, noncropped target image (farthest left image in each row) in order to have more information to work with. The six custom procedures yielded six distractor images for each target image: The green, solid rectangles indicate the final crop of the two distractor images selected for the 3AFC stimulus (rows 2 and 3; see panel B for the final stimulus triplet); red, dashed rectangles indicate the final crops of the distractor images not selected (rows 1, 4, 5, and 6). The sequence of manipulations for each procedure was as follows. Row 1: Overlay a copy of the original image (using only the black parts of the original image) and translate it rightward by 115 pixels and upward by 130 pixels (all original images were 4,032 pixels wide and 6,048 pixels high); overlay another copy of the original image (using only the white parts), and translate it rightward by 488 pixels and downward by 110 pixels; overlay another copy of the original image (using only the black parts), and translate it rightward by 488 pixels and upward by 410 pixels; overlay another copy of the original image (using only the black parts), and translate it rightward by 603 pixels and upward by 540 pixels. Row 2: Flip the original image horizontally; overlay a (horizontally flipped) copy of the original image using “subtraction” (any white areas overlapping a white area become black; any black areas overlapping a white area become white; any black areas overlapping a black area remain black), and translate it rightward by 213 pixels and upward by 316 pixels; overlay another copy of the original image (not flipped horizontally) using “subtraction”; and translate this last copy leftward by 331 pixels and downward by 263 pixels. Row 3: Overlay a copy of the original image and rotate it clockwise by 11°; overlay another copy of the original image (using only the black parts) and rotate it clockwise by 2°; overlay another copy of the original image and delete all white areas adjoining the white area at the top of the original image; flip this second copy vertically, and translate it rightward by 353 pixels and upward by 3,145 pixels. Row 4: Overlay a copy of the original image (using only the white parts) and translate it leftward by 253 pixels and downward by 726 pixels; overlay another copy of the original image (using only the white parts), flip it vertically, and translate it rightward by 149 pixels and upward by 270 pixels. Row 5: Invert the image polarity, and rotate the image 180°. Row 6: Invert the image polarity; overlay a copy of the (polarity-inverted) original image (using only the black parts), and flip it horizontally; flip the overlaid copy vertically; rotate the overlaid copy counterclockwise by 29°. We manually cropped both the final distractor images (farthest right image of each row), as well as the target image, as indicated by the overlaid rectangles. This served to limit contextual information and to isolate a suitable area of the image: The face area for the target images, and an area of ample variation in the black-and-white patches for the distractor images. The cropped images were—owing to being cropped by hand—of different dimensions. However, simply resizing the distractor images to the same dimensions as their corresponding target image could give rise to differences in scaling and amount of detail. In order to prevent these differences from being informative, the distractor images were cropped (instead of resized) once more, this time to the dimensions of their target image. (B) One of our three-alternative forced - choice test items, featuring the target image on the left, and its two accompanying distractors in the middle (the distractor image from panel A, row 3) and on the right (the distractor image from panel A, row 2)

Fig. 2
figure 2

Three additional examples of test items

In a previous study (Verhallen et al., 2014)—using the original 40 Mooney stimuli (Mooney, 1957) in a 3AFC paradigm—we observed a marked ceiling effect in performance, in that almost 10 % of the participants (total N = 397) reached the maximal score. We hypothesized that the participants were sometimes able to respond correctly using cues other than the actual percept of the face—for example, the extent to which the shapes in the images were potentially organic and therefore likely to be part of a face. Alternatively, the participant might try to rule out which two images were not organic or face-like. Thus, participants could respond “correctly” without perceiving the face. In order to avoid this in our new test, we asked participants to respond not just by clicking on the panel (out of three) that showed a face, but rather by clicking on either of the eyes of the face. We quantified the responses by dividing the target image into a 6 × 9 grid of squares: Horizontally, the eyes always fell within columns 2 to 5 inclusive, and vertically, the eyes fell only in one of the four rows C, D, E, or F (see Fig. 3). We used the original photograph to determine the correct eye region for each item. Participants were not informed about this method of quantifying the response, or about the fact that there were only four possible correct regions, but were merely instructed to find the face and to click on either of the eyes.

Fig. 3
figure 3

One of the target images with—overlaid as solid gray lines—the 6 × 9 square grid that was used to quantify responses. The dashed, green thicker lines mark the four rectangular regions in which the eyes could be located: Rows C, D, E, and F (in this case, the correct region is row E), always extending from columns 2 to 5 inclusive. Participants were not told that there were limitations on the possible location of the eyes

For each trial, we recorded the coordinates (in pixels) of the location where the participants clicked with the mouse, and thus we could determine whether the pixel that the participant had clicked was located both within the correct columns (2–5) and in the correct row (C, D, E, or F, depending on the item). If the participant’s click was indeed located in both the correct columns and the correct row for that particular item, the response was recorded as correct; if the participant had clicked on the target image but not within the correct “eye region,” or had clicked on one of the distractor images, the response was labeled as incorrect.

We set out to balance our first stimulus set (316 items) on four variables: (1) equal numbers of stimuli for each of the three possible positions of the target image (the positions being the left, the middle, or the right panel); (2) for each of these three target image positions, equal numbers of stimuli for each of the four possible eye regions (rows C, D, E, or F); (3) for each of the aforementioned possibilities, equal numbers of stimuli for both sexes of the volunteers depicted (female or male); and (4) for each of the aforementioned possibilities, equal numbers of stimuli for the types of background used in the photograph (black or white). The above criteria were not perfectly fulfilled in our set of 316 items: An exact division of the number of photographed volunteers (158) by the number of options (3 × 4 × 2 × 2 = 48) was not possible. However, since the main goal of the first testing phase was stimulus selection, we preferred to use a set that was not perfectly balanced, as opposed to not using all possible images in this initial testing phase.

Selection of the final stimulus set

Using the results from the first testing phase (see “Results” section), we were able to narrow down our selection of 3AFC items from 316 to 144 (the final stimulus set), thereby reducing the overall testing time while retaining the most informative stimuli. We made our selection by fitting a two-component model from item response theory (Embretson & Reise, 2000; Nunnally & Bernstein, 1994) to the performance data, keeping the third component (the guessing parameter) fixed at zero, since all items were identical in arrangement (3AFC) and response (clicking on either of the eyes). The two components of the item response theory model that we calculated for each item were the so-called difficulty and discrimination indices. The former index \( \left(\frac{\Phi^{-1}(p)}{r}\right) \) was calculated by dividing the z score (under the cumulative probability curve) of the proportion of participants who scored correctly [Φ− 1(p)] by the correlation between the score—of all participants—on this particular item and the overall score (r). Item discrimination \( \left(\frac{r}{\sqrt{1-{r}^2}}\right) \) was calculated by dividing the correlation between the score on that particular item and the overall score (r; also referred to as the point-biserial correlation) by the square root of the variance not explained by this correlation \( \left(\sqrt{1-{r}^2}\right) \). The discrimination index is a measure of how well the item (and performance on that item) can discriminate between participants of differing ability. This index was the primary criterion in selecting our new stimulus set—for every item, the value of the discrimination index should be as close to 1 as possible. We sought a large range on the difficulty index, hence we did not restrict it. The second criterion was the requirement to balance the stimulus set on the variables described before (see “Materials and procedure” section).

Unfortunately, simply selecting the set of 144 items with discrimination index values closest to 1 did not satisfy our balancing criteria (described above in “Materials and procedure” section). We were thus forced to swap some items with discrimination indices closest to 1 for items whose value was less close to 1. In order to reduce the number of swaps, we tried to limit the number of variables that needed balancing during item selection: We decided to disregard for the moment the position of the target image among the three panels (whether the target image was the left, middle, or right panel). Instead, after item selection, we reshuffled the positioning of the target and distractor images within each 3AFC item—but not across items, since the difficulty of the stimuli was likely to depend on both the target image and its accompanying distractors—until again our set contained equal numbers of items per target image position (left, middle, or right panel). As for the other three variables, we left unchanged the target images themselves, and thus kept the eye region of the images fixed; the sex of the volunteer depicted in the target image was inherently fixed; and the type of background was disregarded completely, because it proved to have a nonsignificant influence (see “Results” section). With these conditions specified, we selected a set of 144 items for which the discrimination index value for all individual items lay as close as possible to 1, and the set as a whole was balanced on Variables 1 to 3 described at the end of the previous section.

Since a Mooney face, once perceived, seems to be easily found upon repeat presentation, a test–retest paradigm using identical image sets would not be informative. We thus split the new selection into two different parts of equal length (“A” and “B,” each comprising 72 items) to allow for test–retest. Test Parts A and B were each balanced on the same variables as the overall stimulus set of 144 items.

First phase of testing: 316 items

Participants

A total of 374 participants (57 % female) from varying ethnic groups (though predominantly white: 86 %), whose ages ranged from 18 to 68 (M = 26 years), were recruited via word of mouth, social media, online notice boards, and electronic mailing lists. Ethical permission for the study was given by the Cambridge University Psychology Research Ethics Committee.

Materials

The stimulus set of 316 items was used in the first phase of testing (see “Materials and procedure” section).

Procedure

Of our sample, 330 participants (59 % female) completed our test online, while the other 44 participants (45 % female) completed the test in the lab. The procedure for both was identical: We asked participants to give basic demographic information and subjectively to rate their face recognition ability in response to the question “On a scale of 1 to 10 (with 1 being really bad, and 10 being really good), where would you place yourself in terms of recognising faces?” The subsequent instruction screen informed participants that their goal would be to identify the face and asked them to respond by clicking on either of the eyes of the face. Trials did not have a time limit, in order to make sure that we would have response data for every trial, although participants were instructed to respond as quickly as possible. The stimulus remained on the screen until participants responded. An interstimulus interval consisting of a blank screen, but without a fixation cross, was presented for 500 ms between trials. A practice trial with feedback preceded the first of four blocks of 79 trials; no feedback was given on the test trials. Between blocks, participants could take a break of indefinite length. For the online sample, all stimuli were downloaded to the participant’s computer before the test trials began, to ensure that they did not experience a lag between or during trials.

Results

Before analyzing the data, we visually inspected the responses to each item. Since we recorded the exact x- and y-coordinates of participants’ mouse clicks, we could overlay the responses on to the stimuli (see Fig. 4), and thereby could verify the position of the eye region that we had defined during stimulus development. A lack of detail is inherent to Mooney images—the eyes might be embedded in a larger shadowed area—and our definition of the eye region (derived from the original photograph; see “Materials and procedure” section) did not always map well on to the responses of the majority of participants (see Fig. 4B). We thus translated—for 68 out of the 316 items—the eye region up or down by between one and 15 pixels (all target and distractor images were 192 pixels wide and 288 pixels high) and recalculated our performance measure for all participants using these new coordinates of the correct eye region (see Fig. 4C). We used this new performance measure throughout our analyses that follow. We also translated the actual target images for future use, to avoid the need for translating the responses. The final selection of 144 items included 38 of these translated items.

Fig. 4
figure 4

Visual inspection of target images. (A) One of our Mooney target images included in the first test phase (though not in the final set of 144 items). (B) The target image with the responses of 374 participants overlaid. Every symbol represents a participant: The orange, solid circles are responses within the eye region that was initially labeled as correct, and these responses were thus initially deemed correct; the blue, open triangles are responses outside this eye region, and thus were initially deemed incorrect. Also overlaid is the grid of 6 × 9 squares that was used to localize the eye region during stimulus development; in this image, row C is the correct row. (C) The grid of 6 × 9 squares has been translated by –7 pixels (downward translation), since participants’ responses indicated a slight discrepancy between labeled eye region and perceived eye regions. Participants’ responses have been recalculated to reflect the relocation of the correct eye region; the symbols reflect this recalculation, and the final performance measure is based on these recalculated responses

The mean performance of the subset of 44 participants who completed our test in the lab did not differ significantly from that of the sample of 330 participants who completed the test online (Mann–Whitney U = 7,646.5, p = .57). We therefore combined the two samples, using all 374 participants in all subsequent analyses.

In our first testing phase, for our stimulus set of 316 items, the overall performance was wide but negatively skewed: The mean score correct was 77.9 % (SD = 16.6 %), with a range from 3.5 % to 97.5 % (see Fig. 5A). Since the lower end of the performance range was surprising (3.5 % correct), we investigated an alternative performance measure: A score based on the correct clicking of the target image only, regardless of clicking within the correct eye region. The comparison of this measure with our eye-clicking performance measure showed that the majority of the participants in the extremely low range of the latter measure performed reasonably well on the former. Additionally, feedback from a number of participants suggested that some of the participants had forgotten the instructions to click on either of the eyes to respond, and instead had clicked merely on the image that contained the face. However, we could not reliably separate the participants who had forgotten the instructions from those whose performance might actually be at the lower end, and since these data were used for trial selection only, and since participants who had forgotten the instructions were probably consistent in their forgetting, we did not exclude any participants.

Fig. 5
figure 5

(A) Distribution of scores (in percentages) for the participants. The dashed vertical line indicates the mean. (B) Distribution of scores (in percentages of participants scoring correctly) for the items. The dashed vertical line indicates the mean

We observed a significant sex difference favoring males (Mann–Whitney U = 12,279, p = 2.87 × 10–6; \( \overline{x} \) fem = 75.1 %, \( \overline{x} \) mal = 81.7 %—a difference of 0.40 SDs), confirming previous findings using the classical Mooney stimuli (Foreman, 1991; Verhallen et al., 2014). We did not observe a significant correlation between performance and age (Spearman’s ρ = .04, p = .47), even when sex was regressed out from both variables (Spearman’s ρ = .01, p = .84). Participants’ subjectively rated ability for “recognizing faces” (M = 6.8, SD = 1.6, range 1–10) did not correlate significantly with their Mooney performance (Spearman’s ρ = .07, p = .17). We did not observe a significant difference of performance between ethnic groups (Kruskal–Wallis χ 2 = 1.81, p = .77), even when we grouped all non-White participants (since the target images depicted Caucasian volunteers only) and compared them to the group of White participants (Mann–Whitney U = 8,169.5, p = .78). However, the group sizes in our sample were very disparate (N Asian = 24, N Black = 4, N mixed = 19, N other = 5, N White = 322), and if the other-race effect is an effect of training, the ethnicity of the observer would not be the best measure of exposure to faces of a specific ethnic group—instead, country of birth and country of residence should be used.

The distribution of performance per item (i.e., the percentages of participants scoring correctly) showed a large cluster around 80 %–95 % (see Fig. 5B), although no item was solved by every participant. We observed no significant difference in performance for stimuli featuring black backgrounds as compared to those featuring white backgrounds (Wilcoxon signed-rank W = 75,339, p = .068). We did observe a significant difference in participants’ performance for stimuli depicting a female volunteer relative to that for stimuli depicting a male volunteer (W = 60,476.5, p = 2.05 × 10–48): Participants scored on average 4.7 % (or 0.28 SDs) higher for images depicting females (see Table 1). However, there did not appear to be a strong interaction between sex of participant and sex of the target face: The advantage for female over male faces was only slightly greater for female than for male participants (5.2 % vs. 4.2 %, respectively).

Table 1 First test phase: Performance (in percentage) presented separately for the two sexes of the volunteers depicted in the target image (“Female Volunteer” vs. “Male Volunteer”), and broken down by sex of the participant

We observed a significant difference in performance across target image locations (Friedman’s χ 2 = 343.06, p = 3.20 × 10–75), whereby performance was significantly higher for target images in the middle panel (\( \overline{x} \) = 82.3 %), intermediate for those in the left panel (\( \overline{x} \) = 77.4 %), and lowest for those in the right panel (\( \overline{x} \) = 73.7 %). Another significant difference was found for participants’ performance across the four different eye regions (Friedman χ 2 = 735.23, p = 4.82 × 10–159): Performance was highest for row E (\( \overline{x} \) = 84.8 %), lower for row D (\( \overline{x} \) = 82.3 %), lower still for row C (\( \overline{x} \) = 76.8 %), and lowest for row F (\( \overline{x} \) = 67.6 %)—all group differences were significant. However, in the development phase of the experiment, there was no independent means of quantifying the difficulty of a target image or a 3AFC item; hence, difficulty could not be controlled for. The observed differences in performance thus could be inherent to the stimuli, rather than a product of the sex of the volunteer or the location of the eyes.

Since we created two Mooney target images for each photographed volunteer, we investigated a potential priming effect. For each participant, and for every volunteer pair for which the participant had correctly responded to the first image, we computed the mean performance on the second image of the pair and—as a comparison—the mean performance on all other volunteer pairs. Across participants, their mean performance on a volunteer pair’s second image (\( \overline{x} \) = 80.9 %) was significantly higher than their mean performance for all other images (\( \overline{x} \) = 77.9 %; Mann–Whitney U = 80,548, p = .0006). To verify that this result was not an effect of image similarity within volunteer pairs—that is, if images of the same volunteer tended to be similar in difficulty, then selecting only those pairs for which the participant had scored correctly on the first image would yield a skewed measure—we correlated performance between the two images of volunteer pairs. We did not observe a significant correlation (Spearman’s ρ = .09, p = .26).

Second phase of testing: 144 items

Participants

A total of 505 participants (61 % female) from varying ethnic groups (again predominantly white: 84 %), whose ages ranged from 18 to 70 (M = 27 years), were recruited via word of mouth, social media, online notice boards, and electronic mailing lists. There was no overlap of participants between the samples of the two testing phases after we had excluded 43 participants from the second phase who had previously taken part in the first phase (their overall performance was significantly higher, by 6.4 %, or 0.46 SDs, than the performance of the 505 remaining participants who had not taken the test before: Mann–Whitney U = 13,681, p = .0046). Ethical permission for the study was given by the Cambridge University Psychology Research Ethics Committee.

Materials

The final stimulus set of 144 items was used in a test–retest paradigm (see “Selection of the final stimulus set” section for the selection procedure).

Procedure

The second testing phase was conducted entirely online. The procedure was very similar to that of the first phase: Participants were asked to supply basic demographic information and to rate subjectively their face recognition ability, after which they were shown the same instruction screen as in Phase 1. Again, the stimuli remained on the screen until participants had responded, and an interstimulus interval consisting of a blank screen (with no fixation cross) was presented for 500 ms between trials.

However, we now also asked participants to indicate the “country where [they] grew up (or where [they] spent most time until age 18)” using a drop-down menu that listed all countries in the world (ISO 3166 standard; retrieved 19 June, 2014—www.iso.org/obp/). Additionally, we asked participants for their handedness (“left,” “right,” or “both”) in response to the question “Which hand do you write with?” We also asked participants to indicate whether they had “taken (a version of) this test before,” and if the response was “yes,” how long ago they had taken it (Less than 1 week, Less than 1 month, Less than 3 months, Less than 6 months, Less than 1 year, or More than 1 year).

In contrast to the first test phase, the second phase of testing followed a test–retest paradigm: Participants completed Parts A and B in a randomly assigned order, with a minimum interval of three days. Both parts consisted of two blocks of 36 trials each; participants could take breaks of indefinite length between the blocks. The results from Phase 1 had suggested that—as the test progressed—some participants forgot the instructions to click on the eyes of the face, and instead merely clicked on the target image containing the face (see “Results” section of Phase 1). Hence, in Phase 2, a different practice trial with feedback preceded each block of 36 trials, instead of only one practice trial preceding the entire test. Again, no feedback was given on test trials.

Results

The overall performance was again wide, and negatively skewed: The mean score correct was 77.6 % (SD = 14.1 %), with a range from 36.8 % to 98.6 % (see Fig. 6). No participant hit ceiling, and no single item was solved by all participants. We continued to observe a significant sex difference favoring males (Mann–Whitney U = 22,287.5, p = 5.64 × 10–7; \( \overline{x} \) fem = 75.3 %, \( \overline{x} \) mal = 81.2 %—a difference of 0.42 SDs; see Fig. 6). We also continued to observe no significant difference between ethnic groups (Kruskal–Wallis χ 2 = 4.44, p = .49), though the group sizes remained disparate (N E-Asian = 19, N S-Asian = 16, N Black = 2, N mixed = 28, N other = 14, N White = 426). Participants originated from 52 countries, and the largest group of participants (49 %) originated from the United Kingdom. As was the case for ethnicity, we did not observe a significant difference in performance when countries were pooled into five distinct groups: African, Arabic, Asian, Caucasian, and South American (Kruskal–Wallis χ 2 = 3.02, p = .55). However, this time we did observe a significant, though modest, correlation of participants’ subjectively rated ability with their performance (Spearman’s ρ = .12, p = .006).

Fig. 6
figure 6

Density plot of scores (in percentages) for the entire population (black, solid line) and separately for each of the sexes: An orange, dashed line for females, and a blue, dot-dashed line for males. The three vertical dotted lines indicate—from left to right—the mean for female participants (75.3 %), the population mean (77.6 %), and the mean for male participants (81.2 %)

Our 505 participants completed all 144 trials (i.e., the sum of both sessions) in 9.85 min, on average (SD = 4.25 min), with a range of 3.8 to 37 min; two outliers were removed from these statistics, because these two participants took 4 h and 16 h, respectively, to complete the test. We found a modest, but significant, negative correlation between performance and the average time taken per trial, with the influence of sex removed from both variables by means of linear regression (Spearman’s ρ = −.14, p = .002). On the trials in which participants responded correctly, they took on average 3,841 milliseconds to respond (i.e., to reach closure), although this measure includes the time it took to move the mouse and click on the eyes. In contrast, participants took on average 6,258 ms to respond incorrectly.

As in the first test phase, we continued to observe a significant difference in performance across target image locations (Friedman χ 2 = 333.89, p = 3.16 × 10–73), although no longer for all combinations: The difference between the right (\( \overline{x} \) = 71.8 %) and left (\( \overline{x} \) = 79.5 %) panels, as well as that between the right and middle (\( \overline{x} \) = 80.5 %) panels, was again significant, but the difference between the left and middle panels was not. We also continued to observe a significant difference in participants’ performance across the four different eye regions (Friedman χ 2 = 910, p = 6.00 × 10–197), but now this was true only when we compared the outer two rows (C and F) to the middle two rows (D and E); no significant difference in performance was apparent between rows C (\( \overline{x} \) = 70.8 %) and F (\( \overline{x} \) = 70.9 %), nor between rows D (\( \overline{x} \) = 83.8 %) and E (\( \overline{x} \) = 84.9 %). This difference in performance between the outer two and the middle two rows might have been due to participants’ tendency, when guessing, to click in the center of the image. The effect of target image location, as well as the effect of eye region, held when analyzed for females and males independently. We again observed a significant difference in participants’ performance for target images depicting a female as compared to a male volunteer (Wilcoxon signed-rank W = 79,196, p = 5.01 × 10–22; see Table 2). Here the advantage for female relative to male faces did seem to be somewhat larger for female than for male participants (4.0 % vs. 1.6 %, respectively).

Table 2 Second test phase: Performance (in percentage) presented separately for the two sexes of the volunteers depicted in the target image (“Female Volunteer” vs. “Male Volunteer”), and broken down by sex of the participant

In our test–retest paradigm, participants were randomly assigned an order in which they completed parts A and B: Either first A and then B (referred to as “AB”), or first B and then A (“BA”). A minimum of three days separated the two sessions; on average, participants took 6.6 days between sessions. Since a number of participants started but did not complete our test (for this reason, they are not part of the sample of 505 participants we report here), there was a slight discrepancy in sample size for the two different orders: 260 participants completed part A and then part B, whereas 245 participants completed part B and then part A. We investigated whether this discrepancy was due to differential difficulty of the two parts, and found an interaction between difficulty and test order: We observed a small but significant difference in performance between parts A and B for participants’ first session (Mann–Whitney U = 36,150, p = .009; Session 1: \( \overline{x} \) part A = 75.3 %, \( \overline{x} \) part B = 71.2 %), but not for participants’ second session (U = 31,039, p = .62; Session 2: \( \overline{x} \) part A = 81.7 %, \( \overline{x} \) part B = 82.0 %). However, when we regressed out the influence of test order, we did not observe a significant difference in overall performance between parts A and B (Wilcoxon signed-rank W = 127,508, p = 1; \( \overline{x} \) part A = 78.4 %, SD = 14.8 %; \( \overline{x} \) part B = 76.8 %, SD = 16.2 %). We did observe a marginally significant difference between parts in the average time taken to complete each part, again with the influence of test order removed from both measures (W = 71,183, p = .01; \( \overline{x} \) part A = 4.74 min, SD = 2.31; \( \overline{x} \) part B = 5.10 min, SD = 2.95).

To see whether there was a learning effect over sessions, we investigated the difference in performance between the two test sessions, combining the results from the two test orders: We observed that—regardless of the participants’ test order—performance was significantly higher for the second session (\( \overline{x} \) = 81.9 %) than for the first session (\( \overline{x} \) = 73.3 %; W = 85,932, p = 2.83 × 10–19; see Fig. 7). We also observed a significant difference in the time taken between sessions: Participants took less time on average during their second session (\( \overline{x} \) = 4.07 min) than during their first (\( \overline{x} \) = 5.70 min; W = 118,135, p = 3.82 × 10–65). However, parts A and B yielded virtually identical results when running the analyses from previous paragraphs—sex differences, age differences, influence of target image location, and eye region—for the two parts independently.

Fig. 7
figure 7

Scatterplot showing test–retest reliability. Scores (in percentages) are plotted separately for participants who completed part A and then B (“AB”; orange dots) and for participants who completed part B and then A (“BA”; blue diamonds). The saturation reflects the number of participants with that particular score. The diagonal lines show the linear model fitted to the data of the two groups: An orange, solid line for “AB,” and a blue, dashed line for “BA.” The vertical and horizontal dotted lines indicate the mean scores of the two sessions (regardless of test order): 73.3 % for participants’ first session (vertical line), and 81.9 % for their second session (horizontal line). The Spearman–Brown-corrected test–retest correlation was .89 for the entire test of 144 trials

For any new test, there are three measures of reliability worth calculating: Internal reliability (the extent to which performance on individual items correlates with overall performance), parallel-forms reliability (to see whether our parts A and B are indeed equivalent), and test–retest reliability (to see whether a test yields similar results at different points in time). The final set of 144 items had an internal reliability estimate of Guttman’s λ 6 = .97. This estimate considers the amount of variance in each item that can be accounted for by the linear regression of all other items—in other words, how similar are the abilities that each item taps? When we calculated the internal reliability estimates for parts A and B independently, they were Guttman’s λ 6 = .94 and .95, respectively. These estimates are almost identical to the estimate for parts A and B combined (see above)—it is thus likely that the data from a single part would yield results very similar to those from the test as a whole. Indeed, if a very brief testing time is critical, it might prove sufficient to administer only one part—in that case, although our results do not point toward a superiority of one part over the other, we advise using part A, in the interest of consistency across studies.

To obtain the parallel-forms reliability, we correlated participants’ scores on part A with their scores on part B, after having removed—by means of linear regression—any variance due to participant sex and test order. The resulting correlation was Spearman’s ρ = .80 (p = 2.27 × 10–111). Our two parallel forms (part A and part B) are thus very similar. Finally, to obtain the test–retest reliability, we correlated participants’ scores on their first session with their scores on their second session, but only after we had used linear regression to remove the variance due to participant sex and test order. The resulting test–retest correlation was Spearman’s ρ = .80 (p = 6.62 × 10–116; see Fig. 7); the result of a Spearman–Brown correction was a test–retest correlation of \( \frac{2\times .80}{1+\left(2-1\right)\times .80} \) = .89 for the final test as a whole (144 items). Interestingly, we observed a slightly elevated test–retest correlation for those who completed part A and then B (uncorrected ρ = .83), as compared to those who completed part B and then A (uncorrected ρ = .79). We thus suggest—if both parts A and B are indeed administered—that the test order be “AB.”

Figure 8 gives the cumulative distribution curve of our 505 participants for the final stimulus set, to allow these data to be used as normative sample for comparisons to future studies.

Fig. 8
figure 8

Cumulative distribution of performance for the final set of 144 items, from the data of 505 participants gathered in the second phase of testing

Discussion

We have developed a new, online version of the Mooney test that is suitable for test–retest paradigms. The Mooney–Verhallen Test has an internal reliability estimate of Guttman’s λ 6 = .97 and a Spearman–Brown-corrected test–retest correlation of Spearman’s ρ = .89. Participants take on average 9.7 min to complete all 144 trials, and their performance on our test shows marked individual differences in the perception of Mooney faces. The data from the initial 316-item version did not show a significant difference in mean performance between an online and a lab-based sample, a result consistent with other comparisons of online and lab-based administration of tests of face processing (Germine et al., 2012). A possible further avenue to explore would be to develop an adaptive version of the Mooney–Verhallen Test, in order to fully harness the benefits of item response theory. An adaptive version has the potential to further reduce the required length of testing and to increase discriminability across the entire gamut of Mooney face perception ability.

For both the initial stimulus set (316 items) and the final stimulus set (144 items), we continued to observe a significant sex difference favoring males (6.6 % vs. 5.9 %, respectively, or 0.40 and 0.42 SDs), confirming earlier results using the original Mooney stimuli (Foreman, 1991; Verhallen et al., 2014). Although previous studies investigating sex differences in face processing have reported mixed results, they largely point to a female superiority (Megreya et al., 2011; Sommer et al., 2013). The observed sex differences for the Mooney test thus could reflect other, non-face-perception processes that might be at play. A comparison of performance on the Mooney test with performance on other tests of face processing and on tests of visual processing, as well as the manipulation of variables that could affect performance (viewing distance, degradation of images), could shed further light on the exact processes underlying the perception of Mooney images.

In our second test phase, we did not observe a significant correlation of performance with either age or handedness. The latter finding replicates that of Vigen and colleagues (1982), who also did not observe a significant difference in performance between handedness groups (total N = 100 college students). However, they did observe a significant correlation of performance with age, in that performance deteriorated with age, though for females only (Vigen, Goebel, & Embree, 1982). Although we currently did not observe a correlation of performance with age, for our new Mooney test any analyses of age should be interpreted with caution, since the performance measure depends partially on the participant’s dexterity with using a mouse or track pad to click within the correct eye region, a skill on which participants of advanced ages might be impaired.

We also observed, in our second test phase, a significant difference in performance between Sessions 1 and 2: An 8.6 % increase of overall performance, regardless of the order in which participants completed the two parts. The average interval between sessions was 6.6 days, and the improvement could be an example of “reminiscence,” which is observed both for motor skills (Buxton, 1943; Hovland, 1951) and for perceptual skills (Karni & Sagi, 1993) when an interval follows training. That performance is aided by reminiscence is suggested by our preliminary finding that mean performance in our second phase of testing was significantly higher for participants who had previously taken part in our first phase of testing (where the interval was more than six months). However, our repeat participants might have been a self-selecting group—those who performed well might have enjoyed the test more, motivating them to take it again during the second round of testing—and our sample of repeat participants was relatively small (N = 43).

Previous research has shown that priming of Mooney faces using nondegraded photographs has an influence on participants’ subsequent judgments of the familiarity of the person depicted in the Mooney image (Jemel et al., 2003). Since our original stimulus set of 316 items contained two images for each volunteer, we investigated a potential priming effect for the ability to perceive a Mooney face. The average performance for the second image of a volunteer—for those volunteer images for which the participant had previously correctly identified the first image—was significantly, though modestly, higher than the average overall performance. In combination with the absence of a significant correlation of performance between volunteer images (i.e., a participant’s performance on one volunteer image did not predict his or her performance on the other volunteer image), this seems to suggest that the perception of the volunteer’s first image somehow influenced (aided?) the subsequent perception of the volunteer’s second image. Although the Mooney images are two-dimensional, participants may construct an internal three-dimensional model (Moore & Cavanagh, 1998) of the perceived face, which could facilitate perception of a subsequently presented Mooney image of that same face.

In both rounds of testing, we observed a significant difference in performance across target image locations, in that performance for target images presented in the right panel was always significantly lower than for images in the middle and left panels. During the creation of the final stimulus set of 144 items, we reshuffled the 3AFC items in order to balance our set again, thereby changing the position of the target image; yet, the difference was present in both test phases. Although eye movements were unconstrained, the observed difference in performance could be due to preferential processing of one hemifield as opposed to the other. Indeed, a left-hemifield superiority for processing of faces has previously been reported (Bradshaw & Nettleton, 1983), including for Mooney faces (Parkin & Williamson, 1987). Furthermore, neuropsychological, electrophysiological, and fMRI studies suggest a right-hemisphere specialization for the processing of faces (Bentin et al., 1996; McCarthy, Puce, Gore, & Allison, 1997; Newcombe, De Haan, Ross, & Young, 1989), an effect that has also been found for Mooney faces specifically (George et al., 2005; Newcombe & Russell, 1969; Rossion et al., 2011). Future studies could conduct a more rigorous investigation of left-field superiority for Mooney images, either by means of restricting eye movements (having the stimulus disappear whenever the eyes stray too far from a central fixation point) or by briefly flashing the target image in only one hemifield (and subsequently comparing performance across the two hemifields).

The Mooney test remains of interest and continues to be used in visual perception research (especially on face processing), in clinical research, and in studies using brain imaging (Carbon, Grüter, & Grüter, 2013; Grützner et al., 2013; Rivolta et al., 2014; Rossion et al., 2011; Towler, Gosling, Duchaine, & Eimer, 2014). In combination with other measures, and across different populations—both clinical and nonclinical—the quick, reliable, and standardized new Mooney–Verhallen Test might give further insights into the still mysterious nature of closure and its relationship to different genotypes, phenotypes, and behaviors.