In human speech, words often refer to objects (e.g., “sun”), causing listeners to retrieve visual mental images of the target referents (Kreiman, Koch, & Fried, 2000; Pearson, Naselaris, Holmes, & Kosslyn, 2015). In contrast, animal communication signals have typically been considered as motivational: Signals do not convey referential information but simply evoke a stereotyped response in receivers (Rendall, Owren, & Ryan, 2009). Although some mammals and birds have been shown to produce specific calls for specific food or predators, it is still open to debate as to whether these calls could be considered referential where signals evoke mental images of objects in receivers (Suzuki, 2016; Wheeler & Fischer, 2012). In this article, I introduce an experimental study providing the first evidence for call-evoked visual mental images in wild birds (Suzuki, 2018).

Japanese tits (Parus minor) produce acoustically distinct alarm calls (“jar”) when, and only when, encountering a predatory snake (Suzuki, 2014). These calls could be considered as “functionally referential” because they elicit specific antisnake behaviors in receiver tits (Suzuki, 2011, 2012, 2015). For example, when incubating eggs in the nest, adults respond to jar calls by immediately escaping from the nest cavity, allowing them to evade attacks from snakes that can invade the cavity (Suzuki, 2015). When outside of the nest cavity, tits respond to jar calls by looking down at the ground nearby their nest tree, which would be an adaptive behavior used for locating snakes approaching from the ground (Suzuki, 2012).

In addition to snake-specific jar calls, tits have evolved another call type (“chicka”) used for a wide range of predator types, such as avian and mammalian predators (Suzuki, 2014). Chicka calls are generally incorporated with multiple types of notes and can vary in their note composition according to the eliciting contexts (Suzuki, 2014). In response to typical chicka calls (composed of A, B, C, and D notes), receivers approach the sound source and scan the surroundings by moving their heads horizontally, enhancing the detection of a variety of predators (Suzuki, Wheatcroft, & Griesser, 2016). However, tits hearing chicka calls do not show any antisnake behavior, such as looking down toward the ground or escaping out of the nest (Suzuki, 2012, 2015).

Based on these observations, I hypothesized that jar calls refer to snakes and evoke their visual search image in receiver tits. In cognitive and neural sciences, visual mental imagery can be defined as representations and the accompanying experience of sensory information without a direct external stimulus (Kreiman et al., 2000; Pearson et al., 2015; Richardson, 1969). Therefore, at the behavioral level, receivers who retrieve a visual mental image from referential words are expected to have enhanced visual perception of the target object even when they have not yet visually perceived the object (Kok, Mostert, & de Lange, 2017; Lupyan & Ward, 2013). According to this view, in the case of Japanese tits, receivers hearing jar calls are expected to be more visually perceptive to snakes than are those hearing other call types, even in the absence of real snakes.

I tested whether simply hearing jar calls causes tits to become more visually perceptive to snakes in the absence of real snakes (Suzuki, 2018). First, I attracted a tit by the playback of snake-specific jar calls. Second, I exposed a wooden stick being moved in a snakelike manner. If tits retrieve a visual mental image of a snake from jar calls, they may use this image to search out a snake and then show a specific response to the snakelike, moving stick. Using an object that somewhat resembles a snake but does not solely evoke a specific behavior, it could be tested as to whether jar calls evoke visual information about snakes without the tit seeing a real snake.

Tits approached a stick moving snakelike along a tree trunk during the playback of jar calls. However, tits did not respond to the same stick when hearing other call types (chicka calls or recruitment calls) or when the stick’s movement was dissimilar to that of a snake (i.e., swinging on a low shrub). Moreover, the same approach response to the stick was observed when the stick was moved snakelike on the ground in combination with jar calls, but not with chicka calls. These results indicate that before having detected a real snake, tits retrieve a visual search image from jar calls and use this to search out snakes.

Bond (2018) claimed that these results “do not take us far beyond our original account of chained action patterns. If the innate response to chicka calls is to stare at the sky, while that to jar calls is to scan the ground, it does not seem surprising that chicka birds would tend to overlook terrestrial moving sticks” (paragraph 5 of Bond 2018).

However, this claim is based on the misinterpretation of the experimental design. First, no study has shown that chicka calls cause tits to look up at the sky (Suzuki, 2012). Rather, playback experiments revealed that chicka calls used for this experiment (composed of A, B, C, and D notes) elicit horizontal movements of the head and a speaker approach (Suzuki et al., 2016). Second, to control for the possibility that fixed response patterns (e.g., horizontal scanning) may influence the detection of a stick at a certain spatial location, I tested tits’ responses to the movement of sticks at two different locations, one on the ground (e.g., terrestrial) and the other along a tree trunk. A stick moving along a tree trunk would be more readily detected when the birds scanned the horizon (see Fig. 1a–b). In contrast, a stick moving on the ground would be more readily perceived when the birds looked downward than when they scanned the horizon (see Fig. 1c–d). However, regardless of the spatial locations, birds approached a stick moving like a snake when hearing jar calls more than when hearing chicka calls. This consistency in results indicates that tits could readily detect sticks from a distance of ca. 3 m, where they first approached the speaker, regardless of the spatial locations of the moving sticks and the type of alarm calls heard.

Fig. 1
figure 1

Schematic representations of experimental setup. Experiments included the following treatments: a stick moving along a tree trunk in combination with the playback of either jar (a) or chicka (b) calls, and a stick moving on the ground in combination with the playback of either jar (c) or chicka (d) calls. If tits’ responses depend on chained action patterns, then tits are expected to approach a stick when they have a greater chance of visually perceiving it (b and c). In contrast, if tits’ responses depend on the retrieval of a visual search image, but not on the difference in head movements, tits are expected to approach the stick only when hearing jar calls (a and c). The movement of the stick was made by pulling a thin string attached to a tip of the stick. See details in Suzuki (2018)

An important result is that tits approach a stick when hearing jar calls, but not when hearing other call types (chicka calls or recruitment calls). This potentially rejects the possibility that when visually perceiving a stick tits mandatorily approach it as a part of a “chain of action,” no matter what call types they have heard. Another possible control is to expose a stick to tits without any call playback. However, in the field setting, tits normally engage in other activities, such as foraging or singing, precluding the possibility to expose a stick to a focal bird at a close distance. Similar weak responses during the playbacks of chicka calls and recruitment calls reject the possibility that an increased arousal (or simply hearing alarm signals) merely evokes a specific behavior toward the stick movement. In addition, the playback of recruitment calls mimics the situation where tits are exposed to a stick in nonpredatory contexts, since these calls are normally used by birds when facilitating social cohesion with their mated partners or flock members (Suzuki et al., 2016, 2017).

Another important comparison is that tits approach a stick only when it is moved like a snake, that is, they do not approach the nonsnakelike movement of a stick (swinging) even when hearing jar calls. This result indicates that jar calls not only evoke curiosity in tits so as to approach any novel object (i.e., moving sticks) but also cause tits to enhance visual perception of snakelike objects. These findings suggest that jar calls cause tits to adopt a visual searching image of a snake to search out snakes, although it is still unclear how finely tits retrieve morphological features of real snakes.

Bond (2018) proposed an experimental design to test visual search images using realistic snake models. His experimental design is composed of four treatments: (1) uncued = no call, just the realistic snake model; (2) correctly cued = jar call, accompanied by the correct model; (3) miscued = jar call, accompanied by something that violates the cued expectation, perhaps a model of a different snake species; and (4) false alarm = jar call, accompanied by nothing at all (paragraph 8 of Bond 2018). However, I disagree that his design allows us to validate “visual mental images.” As visual mental images are defined as “the retrieval of visual information without seeing a real object” (Kreiman et al., 2000; Pearson et al., 2015; Richardson, 1969), this could be tested under conditions where birds cannot perceive any snake. A similar paradigm has been used for a human study where referential words (nouns) boosted detection of target referents that were hardly recognized without cues (Lupyan & Ward, 2013). Using different models that have different levels of resemblance to real snakes, it could be tested how birds discriminate between snakes and nonsnakes; however, this does not provide a powerful way to validate the presence of visual mental images.

In summary, Suzuki (2018) provides a novel experimental paradigm to explore call-evoked visual search images in wild animals. If calls have been shown to evoke visual search images in receivers, then they could be considered as truly referential (Suzuki, 2016; Wheeler & Fischer, 2012), providing a clue to recent debates on referentiality of animal signals (Rendall et al., 2009; Seyfarth et al., 2010). As Bond (2018) suggested, the exact template of a snake’s image still remains uncertain. However, it might be possible that Japanese tits may form concepts of snakes that could be generalized to different shapes or movements of snakes, just like humans do. Future studies will be required to uncover the cognitive processes underlying this sophisticated referential communication in greater detail, which promises to shed new light on the ecological importance of visual mental imagery in wild animals.