Introduction

To explain evolutionary patterns in animal communication, it is critical to understand the mechanisms of signal production, the conditions under which signals are produced, and how signals are perceived by receivers (Bradbury and Vehrencamp 2011; Brumm and Slabbekoorn 2005; Miller and Bee 2012). During the last decade, it has become clear that communication signals in many taxa are more complex than previously thought (Hebets and Papaj 2005). Complex signals can consist of multiple components in a single modality (e.g., acoustic, visual, or chemical) or in multiple sensory modalities (multimodal communication) with components being presented either together or independently (Partan and Marler 1999; Partan and Marler 2005). In so-called fixed-composite signals (Smith 1977; Partan and Marler 2005), signal components occur always together. Based on their assumed information content, multimodal signals have been classified as redundant (all signal components elicit an equivalent response in the receiver) or non-redundant (signal components elicit a different response in the receiver). Hebets and Papaj (2005) suggested that multiple signal components may evolve when they increase the signal content (content-based hypothesis), facilitate the perception of each other (inter-signal interaction hypothesis), or enhance signal transmission for instance in noisy environments (efficacy-based hypothesis).

Acoustic signal detection and discrimination can be constrained by abiotic and/or biotic noise sources such as waterfalls or vocalizing hetero- or conspecifics, thus favoring the evolution of complex signaling strategies (Gerhardt and Klump 1988; Schwartz and Gerhardt 1989; Brumm and Slabbekoorn 2005; Gordon and Uetz 2012), which could facilitate faster and more accurate detection by receivers (Rowe 1999; Otovic and Partan 2009). However, the selection pressures influencing signaling strategies may differ when environmental noise originates primarily from conspecifics compared with other types of noise because conspecific noise contains a high degree of frequency and temporal overlap between the signals and noise (Gerhardt and Huber 2002).

Anuran amphibians are excellent model systems to study multimodal communication in noisy environments. In many anurans, males produce loud advertisement calls that mediate both female choice and male spacing (Ryan 2001; Gerhardt and Huber 2002). Frog communication may take place in dense breeding choruses (Bee and Micheyl 2008) and/or noisy settings like streams that produce broadband ambient background noise (Boeckle et al. 2009). While advertising, males inflate and deflate the vocal sac which has a primary evolutionary function of recycling air during vocalization, thereby increasing the call rate and distributing sound waves omnidirectionally (Rand and Dudley 1993; Pauly et al. 2006). Since the vocal sac is inevitably moving while a male is calling, it can send a fixed-composite signal (see Hirschmann and Hödl 2006 for exception) imparting increased detectability due to movement and coloration (Rosenthal et al. 2004; Taylor et al. 2008). For example, in the dart-poison frog Allobates femoralis, simultaneous acoustic and visual signals are necessary to evoke an aggressive reaction in males defending a territory (Narins et al. 2003; Narins et al. 2005). For females of the squirrel tree frog (Hyla squirella), the availability of the vocal sac as a visual cue makes an unattractive male call more appealing, whereas additional visual information is assessed from lateral body stripes when male calls are equally attractive (Taylor et al. 2007, 2011b). Despite these examples of a preference for multimodal over unimodal signals, there seem to be vast differences in the importance of signal components and the responses they elicit in the receiver, even in species facing similar ecological problems. Since ecological settings to which animals are exposed can be complex (e.g., many calling individuals, various abiotic sources of noise, etc.), we wanted to test receiver responses in a frog’s natural habitat by using an experimental model setup. Models present three-dimensional visual stimuli that can be detected from a wide angle of view making their use advantageous when the position of receivers cannot be controlled before starting an experiment. The successful use of robots for testing isolated or combined signal components has been demonstrated in studies on a variety of animals, including frogs (Narins et al. 2003; Taylor et al. 2008; Krause et al. 2011).

The Small Torrent Frog (Micrixalus saxicola) occurs along tropical streams and communicates in large social aggregations. Signal detection and discrimination in M. saxicola could therefore be constrained by both conspecific chorus noise and ambient stream noise. Males display a bright white vocal sac during vocalizing (Fig. 1) and perform additional visual signals (e.g., foot-flagging) in male–male agonistic interactions (Krishna and Krishna 2006). Given the acoustically complex environment in which the frogs occur and their signaling behavior, we investigated whether stream noise and/or chorus noise constrains the acoustic signal component in male–male agonistic behaviors (signal efficacy approach, Hebets and Papaj 2005). To do so, we first characterized acoustic features of the male advertisement call and measured sound pressure levels of calls and background noise during the breeding season. Further, we investigated the sensory components in the male display by providing controlled and naturally occurring stimuli (call alone and call with synchronously presented artificial vocal sac) via an experimental setup and examined male responses. These experiments allowed us to test the signal–interaction hypothesis predicting that multimodal composites amplify signal detection and discrimination compared to the unimodal acoustic component (Hebets and Papaj 2005).

Fig. 1
figure 1

M. saxicola male displaying a bright whitish vocal sac during advertising and foot-flagging behavior

Methods

Study site and animals

The Small Torrent Frog (M. saxicola) is endemic to the Western Ghats in India (Daniels 2005) and occurs exclusively along small, fast-flowing streams within the evergreen forests (Chandran et al. 2010). Individuals are diurnal and inhabit perennial streams characterized by low water, air, and soil temperature (Reddy et al. 2002). Males produce calls with a series of pulses from exposed sites on rocks in shallow areas of the stream to advertise for females and defend breeding grounds in relatively crowded aggregations (Krishna and Krishna 2006). We studied a population of M. saxicola located at the Kathalekan Myristica swamp forest (14.27414° N, 74.74704° E) in the central Western Ghats at the end of the monsoon season (September and October 2010). Males in our study population had a mean snout-urostyle length (SUL) of 23.6 mm and a mean mass of 1.1 g (n = 13). Interindividual distance between calling males was measured to determine average receiver distance. Median distance between advertising individuals in the study population was 0.98 m (range, 0.38–2.69 m, n = 15).

The frogs were captured with permission of the Centre for Ecological Sciences, Indian Institute of Science, Bangalore (permission number: D.WL.CR-27/2008-09) and released immediately after taking body measurements. All behavioral experiments were performed without physical contact with the study animals.

Acoustic recordings

After locating a vocalizing male, we recorded advertisement calls from a distance of approx. 1 m, using an omnidirectional microphone (Sennheiser Me 62) and a digital recorder (Zoom HN4; settings: 44.1 kHz, 16-bit resolution). We measured peak sound pressure levels with a sound level meter (Voltkraft SL-100; settings: fast/max, C-weighted) from a distance of 1 m, which equaled the measured median male interindividual distance of 0.98 m. During recordings, the focal male was closer to the microphone than calling neighbors. One second after each advertisement call, a period of 3 s was selected from omnidirectional call recordings to analyze environmental background noise comprising chorusing conspecifics (termed “chorus noise”). We additionally recorded ambient stream noise without male calls (simply termed as “abiotic noise” despite occasional comprising insect signals) before and after frog choruses from the same recording position as calls. The stream was regarded as noise field in which ambient noise intensity was considered almost unchanging within the measured distance of 1 m. For the call analyses, we discarded recordings with overlapping calls from chorusing males. We measured SUL and body mass of each focal individual after the sound recordings with a sliding caliper to the nearest 0.1 mm and a digital mini scale to the neatest 0.01 g.

Acoustic features of recordings were extracted and measured using custom built programs in PRAAT 5.2.22 DSP package (Boersma and Weenink 2011) that automatically logged relevant variables in an output file. To analyze single call notes, we extracted the voiced intervals of the call and measured note duration in seconds. Call duration in seconds was calculated with note start and end times. The spectral structure of calls was investigated using spectrograms (fast Fourier transform (FFT) method; window length, 0.01; time step, 1,000; frequency step, 500; Gaussian window; and dynamic range, 50 dB). For call frequency analysis, a cross-correlation pitch extraction algorithm was used to produce time-varying numerical representation of the fundamental frequency (F 0) contour for each call. We applied a time step of 0.5 ms over a range of 3,000–6,000 Hz according to the F 0 observed in the spectrogram and extracted the parameters mean, minimum, and maximum F 0 from the F 0 contour. The mean frequency value ±500 Hz was used to apply a filter before measuring sound pressure. To extract parameters from noise files, we applied a similar analysis and computed a long-term average spectrum with a bandwidth of 50 Hz to measure maximum frequency. To obtain sound pressure (SP) values of chorus and abiotic noise within the frequency range of the advertisement call, we applied a pass Hann band filter to the spectrum for frequencies from 4,300 to 5,300 Hz. The extracted relative SP values for call and noise were transformed into absolute SP (in pascal) by defining the most intensive SP of the complete sound file (SP absolute = SP relative × SP measured/SP most intensive). “SP measured” corresponds to the maximum sound pressure recorded in the field.

Playback experiments

Unimodal (acoustic) and multimodal (acoustic and visual) stimuli were presented on a platform made out of two plastic containers (Fig. 2). The larger container (7 × 18 × 11 cm) was filled with pebbles and placed in the stream where it served as an anchor for the attached smaller container (6 × 10 cm × 1 cm) and the loudspeaker (Sony SRS-M 30) connected to an MP3 player (Odys Pax). To test if the vocal sac is the primary visual signal component that makes the display more salient to receivers, we presented males exclusively with an artificial inflatable vocal sac. We did not use a stationary model frog as additional visual stimulus or further identification feature. The tip of a white latex glove (inflated diameter, 1 cm) on top of the small container mimicked the vocal sac, which could be inflated by the experimenter by gently blowing air through a 2.5-m-long hose. We tested 10 males with unimodal playbacks and 10 males with multimodal playback presentations resulting in a total of 20 tested individuals (identifiable through photos). Playback stimuli were presented from the experimental setup, placed 50 cm from the focal individual. From a distance of 1 m from the focal male, the experimenter operated the MP3 player and in multimodal presentation manually inflated the artificial vocal sac synchronously with each call. The prerecorded advertisement call was generated by using averaged call values from the studied population (call duration, 2.6 s; note number, 21; mean frequency, 4.6 kHz; intercall interval, 7.4 s). The acoustic stimulus consisted of three advertisement calls with an average intensity of 75 dB at 50 cm. Experimental playback presentations were undertaken only when the focal individual showed no signaling behavior in the prior 60 s. Either unimodal or multimodal stimuli were presented for a period of 30 s, followed by a 90-s control phase and a subsequent second same stimulus playback and control phase. All trials were video recorded with a waterproof camera (Sanyo Xacti WH1) positioned on a tripod. We analyzed frequencies and durations of the behavior categories “calling,” “tapping,” “foot-flagging,” and “position change” during presentation and control phases with the behavioral coding software Solomon Coder (Péter 2011). Tapping constitutes the lifting of either the right or left leg without stretching it, whereas foot-flagging labels the behavior of completely extending the leg above and back in an arc and bringing it back to the body side (Hödl and Amézquita 2001). Behaviors termed position change included approach, moving away, and turn. Digital photographs of dorsal patterns and colorations allowed individual recognition and ensured that we were able to avoid repeat testing of the same individuals.

Fig. 2
figure 2

Schematic diagram of the experimental setup positioned 50 cm from the focal individual. In the stream, the lower box (1) serves as anchor to the upper setup and a loudspeaker (2) connected to an MP3 player (3). Silicon hose (4) operated by the experimenter inserted through the upper box (5) to the artificial vocal sac: tip of a latex glove (6)

Data analysis

To test the hypothesis that M. saxicola advertisement calls are masked by noise, we analyzed 112 calls and respective noise recordings from 13 individuals, in turn comprising measurements at 13 positions in the stream. We compared maximum sound pressure values of the acoustic factors: advertisement calls, abiotic noise, abiotic noise in the frequency range of the call (filtered abiotic noise), and chorus noise in the frequency range of the call (filtered chorus noise) using a linear mixed model (LMM). The LMM allows repeated measurements of the same individual to be fitted in the model as random variables, thus controlling for differing number of calls per male and notes per call. The statistical assumptions for LMM analysis were met (Kolmogorov–Smirnov test).

The sound pressure values (in pascal) of all acoustic factors were entered as dependent variables, with the acoustic factors as predictor variables. We entered the identities of male (call) and call (note) as nested random variables, to correct for differences between male individuals, number of calls per male and number of notes per call. For post hoc tests, we used Student’s t statistic with sequential Bonferroni correction for alpha because of repeated pairwise comparisons.

A second LMM was conducted to evaluate the differences between dominant frequencies of call and background noise. To compare frequencies of call and noise, the dominant frequencies of these parameters were entered as dependent variables with call and noise as predictor variables. The identities of male (call) and call (note) were entered as nested random variables. To test if male SUL and body mass influence mean dominant call frequency, we performed a linear regression analysis.

To investigate the hypothesis that the inflating vocal sac acts as an additional visual cue, we compared behavioral responses to unimodal and multimodal playbacks using a two-tailed Mann–Whitney U test for independent samples. To analyze differences in call duration in response to playback presentations, we used a LMM to correct for differing numbers of calls per individual. Call duration was entered as a dependant variable, with modality (unimodal vs. multimodal) as the predictor variable. A nested term was included for the identity of male (call) as a random variable. All analyses were undertaken using SPSS version 19 (SPSS Inc., Chicago, IL, USA).

Results

Frog calls and environmental noise

Advertisement calls of M. saxicola (Fig. 3) had an average duration of 2.0 ± 0.1 s (all results ± SE and N = 13 in all cases) and comprised a series of 21 ± 1 notes with an average duration of 0.021 ± 0.001 s (Fig. 3a, b). Single pulsed notes were produced at the beginning and end of the call with an inter-note interval of 0.136 ± 0.005 s, whereas grouped notes in the middle of the call had multiple pulses and an inter-grouped notes interval of 0.03 ± 0.001 s (Fig. 3c). The frequency of the advertisement calls averaged 4,771 ± 29 Hz (range, 4,574–4,969 Hz, Fig. 3d) and was negatively influenced by SUL (linear regression: N = 13, r 2 = 0.37, P = 0.016), but not affected by body weight (linear regression: N = 13, r 2 = 0.08, P = 0.176).

Fig. 3
figure 3

Multi-note advertisement call of M. saxicola (ad). The spectrogram (a) of a single call (FFT method; window length, 0.005 s; time step, 1,000; frequency step, 1,000; Gaussian window; dynamic range, 30 dB); b the corresponding waveform; c a close-up of the three indicated notes from the same male; d power spectrum showing the peak of the call at 4.9 kHz relative to the ambient stream noise

The call frequency showed clear differences to the low-frequency dominated stream noise (LMM: pairwise comparison, ß = 4168; SE = 22; t = 188.087, P < 0.001). The maximum SP of the call averaged 0.056 Pa (69 dB) at a distance of 1 m. Overall SP comparisons of call and noise differed significantly (LMM: F 3, 2,357 = 39.806, P < 0.001). At 1 m distance, the call had a higher SP than abiotic noise values (LMM: pairwise comparison, P < 0.001, Fig. 4), but did not differ from the SP of conspecific chorus noise filtered in the frequency range of the call (LMM: pairwise comparison, ß = 0.010; SE = 0.005; t = 1.829, P = 0.068). The estimated maximum SP of chorus noise averaged 0.046 Pa (67 dB) resulting in a difference of −2 dB relative to the analyzed frog calls.

Fig. 4
figure 4

Comparison of sound pressure of advertisement calls of M. saxicola and the background noise. Shown here are estimated means (points), standard error (boxes), and 95 % confidence intervals (whiskers) of the call (N = 13), abiotic noise (N = 13), abiotic noise filtered in the frequency range (4.3–5.3 kHz) of the call (N = 13), and chorus noise filtered in the same frequency range as abiotic noise. All pairwise comparisons apart from call and chorus noise filtered indicate significant differences (P < 0.001)

Playback experiments

When presented with multimodal stimuli, all tested males increased the number of calls they produced (Mann–Whitney U test = 12, N 1 = N 2 = 10, P = 0.004) and tapping behaviors (Mann–Whitney U test = 12, N 1 = N 2 = 10, P = 0.002) and performed more position changes (Mann–Whitney U test = 17, N 1 = N 2 = 10, P = 0.007) in comparison to unimodal trials. Most interestingly, foot-flagging behavior could only be elicited by multimodal playbacks (Fig. 5) and the mean call duration of 2.2 ± 0.7 s during unimodal playbacks expanded to 5.0 ± 0.36 s during multimodal presentations (LMM: F 1, 165 = 12.519, P = 0.001, Fig. 6).

Fig. 5
figure 5

Behavioral responses of M. saxicola to unimodal and multimodal playback stimuli. **P < 0.01; *P < 0.05, statistical significant differences between responses (Mann–Whitney U test; for each stimulus presentation N = 10)

Fig. 6
figure 6

Differences in duration of M. saxicola calls in response to acoustic- and multimodal playback presentations (P = 0.001; LMM)

Discussion

Our results show that sound pressure levels of M. saxicola male calls significantly exceeded the ambient abiotic noise level in the frogs’ habitat. Stream noise had less energy than frog calls across the entire human audible frequency range. Contrary to species of the Bornean Splash Frogs (Staurois) who inhabit low-frequency dominated, torrential streams and waterfalls (Boeckle et al. 2009; Grafe et al. 2012), continuous stream noise alone was unlikely to constrain acoustic signal detection in M. saxicola. Chorus noise, however, appeared more likely to hamper individual call detection as the measured SPL differences between male M. saxicola calls and chorus noise were small (2 dB). Unfortunately, the perceptual capabilities of M. saxicola’s auditory system are unknown but studies on other frog species suggests that the substantial frequency overlap of conspecific noise observed in the present study is likely to interfere with acoustic signal detection (Wilczynski et al. 1993; Schwartz and Gerhardt 1998; Wollerman and Wiley 2002; Bee 2008; also see Schwartz and Gerhardt 1998 for improved call detection in the presence of noise). Chorus noise can vary based on seasonal and population density thereby creating a fluctuating environment, which has been suggested to favor the evolution of multimodal signals (Bro-Jørgensen 2010).

In the presence of noise, several different strategies can increase the probability of signal recognition and detection. One strategy to provide release from masking is spatial distribution. In Cope’s gray tree frog (Hyla chrysoscelis), improvements in signal detection and mate recognition are obtained when a signal is spatially separated from a masker, particularly at a signal-to-noise ratio (SNR) of −3 dB (Bee 2008). However, if calls are not spatially separated from background noise, females of a Neotropical treefrog (Hyla ebraccata) located calls with +3 dB but not +1.5 dB SNR (Wollerman 1999). In several bird species, spectral shifts and/or amplitude adjustments have been reported in areas with noise (Slabbekoorn and Peet 2003; Nemeth and Brumm 2010), likewise adaptive strategies such as high-frequency calls have evolved in anurans (Feng and Narins 2008; Boeckle et al. 2009) to increase the signal-to-noise ratio in unimodal signals. There are also an increasing number of examples of the use of visual displays to enhance signal efficacy in noisy environments including in crabs (Uca mjoebergi) and lizards (Anolis cristatellus, A. gundlachi, and Amphibolurus muricatus) (Peters and Evans 2003; Ord et al. 2007; Milner et al. 2008).

The observed +2 dB SNR in M. saxicola could be a sufficient detection threshold, but we also have to consider that the results can be explained by differences in distance to the microphone between the focal male and the more distant, and thereby degraded, neighbor calls. In dense aggregations and close-range interactions, we would expect the SNR to be less or even negative depending on the position of the receiver. Hence, spatial segregation of opponent males could reduce masking, and interacting signal components could be beneficial for early detection and localization of conspecifics.

In our behavioral experiments, multimodal stimuli significantly increased the frequency of response behaviors compared to unimodal acoustic stimuli, and foot-flagging behavior could only be elicited by multimodal stimuli. We suggest that the visual component acts as an amplifier to the acoustic component supporting the inter-signal interaction hypothesis. The advertisement call may serve as long-range signal (Bee 2007), and integration of a pulsating vocal sac could facilitate localization in dense aggregations of concurrently calling individuals (Gomez et al. 2011; Taylor et al. 2011a, b). The localization and detection of a caller is more difficult when masked by conspecific calls with a high degree of spectral overlap (Marshall et al. 2006), making the visual epiphenomenon particularly advantageous in large choruses. The visual component as part of the acoustic signal has been suggested to modulate male reactions including attacks (Narins et al. 2003; de Luna et al. 2010); therefore, multimodal signals displayed in close proximity could trigger a more intense response. Across-species quantification of multimodal signals in spiders (Hebets 2008), fish (Van Staaden and Smith 2011), and anurans (Taylor et al. 2011a) provide evidence for response variations and highlight differences in signal dominance and receiver perception. It remains difficult to draw assumptions on signal function, but quantification of receiver responses under different environmental conditions will help to explain the processes acting on complex signals.

Hödl and Amézquita (2001) discussed ecological conditions favoring the evolution of visual signals in anurans such as displays at elevated perches, diurnality, and ambient noise which all apply to the study species M. saxicola. Although males can be observed advertising around the year, aggregation density is greatest during the presumable main breeding period at the end of the monsoon season (Gururaja, personal communication). During this period, large aggregations form in certain parts of the stream that provide favorable conditions for reproduction, such as shallow water riffle areas where males perch on rocks and display and females dig underwater oviposition cavities (Gururaja 2010).

When attending conspecific choruses, males increase the probability of attracting a mate (Gerhardt and Huber 2002) but have to face continuous noise levels and limited options to deal with masking interference. However, an increase in signal duration as observed in M. saxicola males during multimodal stimuli presentations could not only indicate fighting ability to the opponent and facilitate spacing but enhance a male’s detectability for females in a chorus. Additionally, playback experiments in numerous anuran species have demonstrated that females prefer long call durations usually associated with energetic costs (Gerhardt and Huber 2002) and possible weight loss in males during the breeding season (Murphy 1994). Calling activity and mating tactics could be related to body mass in M. saxicola, whereas spectral features of the call, determined mostly by larynx size (Gerhardt and Huber 2002), are not expected to be affected by the weight of an individual. Snout-urostyle length, however, showed a negative influence on call frequency, which could be a reliable cue to body size for receivers when detecting a call. Lower frequency calls would indicate larger body size usually preferred by female conspecifics (Ryan and Keddy-Hector 1992).

We conclude that limited shallow water areas in the stream used for reproduction by M. saxicola lead to strong competition between males, and dense breeding choruses create constant background noise levels. Our results indicate that multimodal signals are necessary to evoke agonistic behavior in this species. Thus, we suggest that the acoustic signal component modulated by the visual component makes the display more salient and facilitates localization and detection of nearby opponent individuals.

The vocal sac in anurans did not evolve as a visual cue but as an organ to improve calling ability, yet its role in communication has been demonstrated in a number of studies. Due to its evolutionary background, the vocal sac’s secondary function as a visual cue or signal component is inevitably linked to the acoustic component. Accordingly, it seems not surprising that the present and several other studies found interaction between the acoustic and visual signal components (Narins et al. 2003; Rosenthal et al. 2004; Gomez et al. 2011; Zeyl and Laberge 2011). In the majority of anuran species, the vocal sac is a multimodal fixed-composite signal, but demonstrating whether the visual component adds additional signal content not included in the call and assessing signal information content remain difficult tasks. In M. saxicola, the visual displays (e.g., foot-flagging) presented independently of calls further add to the complexity in communication behavior but may also allow for sophisticated behavioral experiments. We suggest a research approach focusing on receiver detection sensitivity and receiver response to visual signaling behaviors that can be performed independently of the auditory signal (e.g., foot-flagging) to explain how selection on senders and receivers promotes complex displays under different acoustic and environmental conditions. Moreover, further across-species comparisons of how single and combined signal components influence receivers are essential to draw conclusions on signal function.