8.1 Introduction

Researchers have a natural tendency to classify biological systems into categories. For example, organisms can be classified based on biome, ecosystem, taxon, phylogeny, niche, demographic class, behavior type, etc., and this allows complex systems to be organized. Categorization also can make recognition of patterns easier and assist in understanding the ways in which biological systems work. Classification provides a convenient method for comparing features, making systematic measurements, testing hypotheses, and performing statistical analyses.

Bioacousticians have categorized sounds produced by animals for decades, and new methods for classification continue to be developed (Horn and Falls 1996; Beeman 1998). Animals produce many different types of sounds that span orders of magnitude along the dimensions of time, frequency, and amplitude. For example, the repertoire of marine mammal acoustic signals includes broadband echolocation clicks as short as 10 μs in duration and with energy up to 200 kHz, as well as narrowband tonal sounds as low as 10–20 Hz, lasting more than10 s. Song birds and some species of baleen whales arrange individual sounds into patterns called song and repeat these patterns for hours or days. Some mammal species produce distinctive, stereotyped sounds (e.g., chipmunks, dogs, and blue whales), while others produce signals with high variability (e.g., mimicking birds, primates, and dolphins).

Because animals produce so many different types of sounds, developing algorithms to detect, recognize, and classify a wide range of acoustic signals can be challenging. In the past, detection and classification tasks were performed by an experienced bioacoustician who listened to the sounds and visually reviewed spectrographic displays (e.g., for birds by Baptista and Gaunt 1997; chipmunks by Gannon and Lawlor 1989; baleen whales by Stafford et al. 1999; and delphinids by Oswald et al. 2003). Before the advent of digital signal-analysis, data were analyzed while enduring the acrid smell of etched Kay Sona-Graph paper and piles of 8-s printouts removed from a spinning recording drum littering laboratory tables and floors. Output from a long-duration sound had to be spliced together (see Chap. 1). Many bioacoustic studies generated an enormous amount of data, which made this manual review process at best inefficient, and at worst impossible to accomplish.

For decades, scientists have worked to automate the process of detecting and classifying sounds into categories or types. Automated classification involves three main steps: (1) detection of potential sounds of interest, (2) extraction of relevant acoustic characteristics (or, features) from these sounds, and (3) classification of these sounds as produced by a particular species, sex, age, or individual. Methods for the automated detection of sounds have progressed quickly with technological advances in digital recording (see Chap. 2). Likewise, the extraction of sound variables useful in analysis has expanded with an increasing amount of information provided by new technology. For instance, where features such as maximum frequency or time between sounds originally were measured manually off sonagraph paper, devices today allow for measuring these, and many more variables, automatically or semi-automatically using computer software. Now, derived variables, such as time difference between individual signal elements, frequency modulation, running averages of sound frequency, and harmonic structure can be easily obtained for classifying the sounds in a repertoire.

Some of the earliest methods used for automated detection and classification included energy threshold detectors (e.g., Clark 1980) and matched filters (e.g., Freitag and Tyack 1993; Stafford et al. 1998; Dang et al. 2008; Mankin et al. 2008). These methods were used to detect and classify simple, stereotypical sounds produced by species such as the Asian longhorn beetle (Anoplophora glabripennis), cane toads (Rhinella marina), blue whales (Balaenoptera spp.), and fin whales (Balaenoptera physalus). Once sounds are detected, they can be organized into groups, or classified, based on selected acoustic characteristics. For example, development of methods for detection and automated signal processing of bat sounds led to a variety of automated, off-the-shelf, ready-to-deploy bat detectors that detect and classify sounds by species (Fenton and Jacobson 1973; Gannon et al. 2004). These detectors can be very useful in addressing biological or management issues in ecology, evolution, and impact mitigation. While the accuracy and robustness of automated approaches are always a matter of concern (Herr et al. 1997; Parsons et al. 2000), modern techniques promise much improved recognition performances that could rival manual analyses (e.g., Brown and Smaragdis 2009).

Multivariate statistical methods can be powerful for classification of sounds produced by species with variable vocal repertoires because they can identify complex relationships among many acoustic features (see Chap. 9). With the advent of powerful personal computers in the 1980s and 1990s, the use of multivariate techniques became popular for classifying bird sounds (e.g., Sparling and Williams 1978; Martindale 1980a, b). Since then, enormous effort has been expended to develop these and other automatic methods for the detection of sounds produced by many taxa and their classification into discrete categories, such as species, population, sex, or individual.

These days, there are applications (apps) for smartphones that use advanced algorithms to automatically detect and recognize sounds. For example, the BirdNET app detects and classifies bird song—similar to the Shazam app for music—and provides a listing of the top-ranked matching species. It includes almost 1000 of the most common species of North America and Europe. A similar app, Song Sleuth, recognizes songs of nearly 200 bird species likely to be heard in North America and also provides references for species identification, such as the David Sibley Bird Reference (Sibley 2000), allowing the user to “dig into” the bird's biology and conservation needs.

In this chapter, we present an overview of methods for detection and classification of sounds along with examples from different taxa. No single method is appropriate for every research project and so the strengths and weaknesses of each method are summarized to help guide decisions on which methods are better suited for particular research scenarios. Because algorithms for statistical analyses, automated detection, and computer classification of animal sounds are advancing rapidly, this is not a comprehensive overview of methods, but rather a starting point to stimulate further investigations.

8.2 Qualitative Naming and Classification of Animal Sounds

Prior to computer-assisted detection and classification of animal sounds, bioacousticians used various qualitative methods to categorize sounds.

8.2.1 Onomatopoeic Names

Frequently, researchers describe and name animal sounds based on their perception of the sound and thus based on their own language. This approach has been common in the study of terrestrial animals (in particular, birds) and marine mammals (in particular, pinnipeds and mysticetes). Researchers also have given onomatopoeic names to sounds. These are names that phonetically resemble the sound they describe. For example, the sounds of squirrels and chipmunks have been described as barks, chatters, chirps, and growls. The primate literature is also rich in these sorts of sound descriptions (e.g., the hack sequences and boom-hack sequences described for Campbell’s monkeys, Cercopithecus campbelli; Ouattara et al. 2009). Bioacousticians studying humpback whales (Megaptera novaeangliae) have described a repertoire of sounds including barks, bellows, chirps, cries, croaks, groans, growls, grumbles, horns, moans, purrs, screams, shrieks, sighs, sirens, snorts, squeaks, thwops, trumpets, violins, wops, and yaps (Dunlop et al. 2007, 2013). While it is potentially convenient for researchers within a group to discuss sounds this way, it is more difficult for others, and perhaps impossible for foreign-language speakers to recognize the sound type. An example of this difficulty in describing a sound is the ubiquitous rooster crow, which can be described by a US citizen as “cock-a-doodle-doo” and by a German citizen as “kikeriki”. Roosters make the same sound, no matter in which country they live, yet their single sound has been named so differently, as has the bark of dogs (Fig. 8.1). Of course, onomatopoeic naming of sounds also fails when the sounds are outside of the human hearing range.

Fig. 8.1
figure 1

Dogs speak out. Labels used for dog barks in different countries

If the above was not confusing enough, bird calls have been described using onomatopoeic phrases. For example, the song of a white-throated sparrow (Zonotrichia albicollis) has been described in Canada as sounding like “O sweet Canada Canada Canada” and in New England, USA, as “Old Sam Peabody Peabody Peabody.” Another example is the barred owl (Strix varia), which hoots “Who cooks for you? Who cooks for you all?”.

8.2.2 Naming Sounds Based on Animal Behavior

Researchers sometimes name sounds based on observed and interpreted animal behavior. For example, the various echolocation signals described for insectivorous bats have been named “search clicks” (i.e., slow and regular clicks) while pursuing insect prey and “terminal feeding buzz” (i.e., accelerated click trains) during prey capture (Griffin et al. 1960). The bird and mammal literature is replete with sounds named for a behavior, such as the begging call of nestling chicks (Briskie et al. 1999; Leonard and Horn 2001), the contact call for isolated young (Kondo and Watanabe 2009), and the alarm call warning of a nearby predator (Zuberbuhler et al. 1999; Gill and Bierema 2013). In some cases, the function of sounds has been studied in detail, which justifies using their function in the name. Examples are feeding buzzes in echolocation or alarm calls in primates. However, naming sounds according to behavior can be misleading because a sound can be associated with several contexts. Names based on the associated behavior should really only be used after detailed studies of context-specificity of the calls in question.

8.2.3 Naming Sounds Based on Mechanism of Sound Production

Some bioacousticians identify and classify sounds based on the mechanism of sound production. For example, one syllable in insect song corresponds to a single to- and fro-movement of a stridulatory anatomy or one cycle of a forewing opening and closing in the field cricket (Gryllus spp.). McLister et al. (1995) defined a note in chorusing frogs as the sound unit produced during a single expiration. Classifying sound types by their mode of production perhaps is less ambiguous and unequivocal, but there are limited data on the mechanisms of sound production in many animals.

8.2.4 Naming Sounds Based on Spectro-Temporal Features

An alternative, but not necessarily better, way of naming sounds is based on their spectro-temporal features. For instance, in distinguishing two morphologically similar species of bats, Myotis californicus is referred to as a “50-kHz bat” and M. ciliolabrum as a “40-kHz bat,” which describes the terminal frequency of the downsweep of their ultrasonic echolocation signals (Gannon et al. 2001). Under water, the most common sound recorded from southern right whales (Eubalaena australis) is a 1–2 s frequency-modulated (FM) upsweep from about 50–200 Hz, commonly recorded with overtones, and referred to in the literature as the upcall (Fig. 8.2; Clark 1982). Antarctic blue whales (Balaenoptera musculus intermedia) produce a Z-call, which consists of a 10-s constant frequency (also called constant-wave, CW) sound at 28 Hz, followed by a rapid FM downsweep to 18 Hz, where the sound continues for another 15-s CW component (Rankin et al. 2005).

Fig. 8.2
figure 2

Spectrograms of southern right whale “upcall” (left; sampling frequency fs = 12 kHz, Fourier window length NFFT = 1200, 50% overlap, Hann window) and Antarctic blue whale “Z-call” (right; fs = 6 kHz, NFFT = 16384, 50% overlap, Hann window) recorded off southern Australia (Erbe et al. 2017)

While the measurement of features from spectrograms and waveforms can be expected to be more objective than onomatopoeic or functional naming, the appearance of a spectrogram, and thus the measurements made, depend on characteristics of the recording system, the time and frequency settings of the analysis algorithm, and analysis algorithm used. This can make sounds look rather different at various scales and therefore lead to inconsistent classification.

An example of the confusion that can arise from different representations of sound is the boing sound made by minke whales (Balaenoptera acutorostrata), which was given an onomatopoeic name. In spectrograms, the boing might look like an FM sound (Fig. 8.3a), however, it is actually a series of rapid pulses (Rankin and Barlow 2005), similar to burst-pulse sounds produced by odontocetes (e.g., Wellard et al. 2015). As another example, the bioduck sound made by Antarctic minke whales (Balaenoptera bonaerensis) got its name because it resembles a duck’s quack to human listeners (Risch et al. 2014). A spectrogram of the bioduck sound appears as a series of pulses; however, each pulse actually is a 0.3-s FM downswept tone from 300 to 100 Hz (Fig. 8.3b). As if this was not enough in terms of interesting sounds and odd names, dwarf minke whales produce the so-called star-wars sound, which is composed of a series of pulses with varying pulse rates (Gedamke et al. 2001). The different pulse rates make this sound appear as a mixture of broadband pulses and FM sounds in spectrograms, depending on the spectrogram settings (Fig. 8.3c). The sound name presumes the reader is familiar with the sound-track of an American movie from the 1970s.

Fig. 8.3
figure 3

Spectrograms of the dwarf minke whale boing (a fs = 16 kHz, NFFT = 1024, 50% overlap, Hann window), the Antarctic minke whale bioduck sound (b fs = 96 kHz, NFFT = 8192, 50% overlap, Hann window), and the dwarf minke whale star-wars sound (c fs = 44 kHz, NFFT = 4096, 50% overlap, Hann window). Recordings a and b from Erbe et al. (2017), c from Gedamke et al. (2001)

8.2.5 Naming Sounds Based on Human Communication Patterns

The term “song” is perhaps the best-known example of using human communication labels in the description of animal sounds. The word “song” may be used to simply indicate long-duration displays of a specific structure. Songs of insects and frogs are relatively simple sequences, consisting of the same sound repeated over long periods of time. The New River tree frog (Trachycephalus hadroceps), for example, produces nearly 38,000 calls in a single night (Starnberger et al. 2014). Many frogs use trilling notes in mate attraction, which has been described as song, but switch to a different vocal pattern in aggressive territorial displays (Wells 2007). In some frog songs, different notes serve different purposes, with one type of note warding off competing males, and another attracting females. In birds and mammals, songs are often more complex, consisting of several successive sounds in a recognizable pattern. They appear to be used primarily for territorial defense or mate attraction (Bradbury and Vehrencamp 2011). Our statements in this chapter show one way to describe calls and songs in animals; however, it is important to note that borrowing terminology from human communication when studying animals can lead to confusion. The terms we discuss here are not well defined and are used differently by different authors. Make sure to pay close attention to these definitions when reading literature about animal communication.

Some ornithologists have used human-language properties further to describe the structure of bird song. Song may be broken down into phrases (also called motifs). Each phrase is composed of syllables, which consist of notes (or elements, the smallest building blocks; Catchpole and Slater 2008). Notes, syllables, and phrases are identified and defined based on their repeated occurrence. An entire taxon of birds (songbirds, Order Passeriformes) has been designated by ornithologists because of their use of these elaborate sounds for territorial defense and/or mate attraction. Birds of this taxon usually use sets of sounds that are repeated in an organized structure. In many species, males produce such songs continuously for several hours each day, producing thousands of songs in each performance. In the bird song literature, songs are distinguished from calls by their more complex and sustained nature, species-typical patterns, or syntax that governs their combination of syllables and notes into a song. Songs are under the influence of reproductive hormones and associated with courtship (Bradbury and Vehrencamp 2011). Bird song can vary geographically and over time (e.g., Fig. 8.4; Camacho-Alpizar et al. 2018). In contrast, calls are typically acoustically simple and serve non-reproductive, maintenance functions, such as coordination of parental duties, foraging, responding to threats of predation, or keeping members of a group in contact (Marler 2004).

Fig. 8.4
figure 4

Geographic variation in birdsong. These spectrograms show a portion of song from Timberline wrens (Thryorchilus browni) recorded at four locations in Costa Rica (CBV = Cerro Buena Vista, CV = Cerro Vueltas, CCH = Cerro Chirripó, IV = Irazú Volcano) (Camacho-Alpizar et al. 2018). © Camacho-Alpizar et al.; https://doi.org/10.1371/journal.pone.0209508. Licensed under CC BY 4.0; https://creativecommons.org/licenses/by/4.0/

Several terrestrial mammals have been reported to sing. For instance, adult male rock hyraxes (Procavia capensis) engage throughout most of the year in rich and complex vocalization behavior that is termed singing (Koren et al. 2008). These songs are complex signals and are composed of multiple elements (chucks, snorts, squeaks, tweets, and wails) that encode the identity, age, body mass, size, social rank, and hormonal status of the singer (Koren and Geffen 2009, 2011). Holy and Guo (2005) described ultrasonic sounds from male laboratory mice (Mus musculus) as song. Von Muggenthaler et al. (2003) reported that Sumatran rhinoceros (Dicerorhinus sumatrensis) produce a song composed of three sound types: eeps (simple short signals, 70 Hz–4 kHz), humpback whale like sounds (100 Hz–3.2 kHz, varying in length, only produced by females), and whistle blows (loud, 17 Hz–8 kHz vocalizations followed by a burst of air with strong infrasonic content). Clarke et al. (2006) described the syntax and meaning of wild white-handed gibbon (Hylobates lar) songs.

Among marine mammals, blue, bowhead (Balaena mysticetus), fin, humpback, minke, and right whales, Weddell seals (Leptonychotes weddellii), harbor seals (Phoca vitulina), and walrus (Odobenus rosmarus) have all been reported to sing (Payne and Payne 1985; Sjare et al. 2003; McDonald et al. 2006; Stafford et al. 2008; Oleson et al. 2014; Crance et al. 2019). The songs of blue, bowhead, fin, minke, and right whales are simple compared to those of the humpback whale and little is known about the behavioral context of song in any marine mammal species besides the humpback whale. Humpback whales are well-known for their long, elaborate songs. These songs are composed of themes consisting of repetitions of phrases made up of patterns of units similar to syllables in bird song (Fig. 8.5; Payne and Payne 1985; Helweg et al. 1998). Winn and Winn (1978) suggested that only male baleen whales sing, as a means of reproductive display. Sjare et al. (2003) reported that Atlantic walrus produce two main songs: the coda song and the diving vocalization song that differ by their pattern of knocks, taps, and bell sounds.

Fig. 8.5
figure 5

Spectrogram of the song structure of humpback whales, with sounds organized by theme, phrases, and units (Garland et al. 2017). © Acoustical Society of America, 2017. All rights reserved

Song production does not exclude the emission of non-song sounds and most singing species likely emit both. The non-song sounds of humpback and pygmy blue whales (Balaenoptera musculus brevicauda), for example, have been cataloged (e.g., Recalde-Salas et al. 2014, 2020). Some song units may resemble non-song sounds.

Whether sounds are part of song or not, their detection and classification can be challenging when repertoires are large and possibly variable across time and space. Humpback whale songs, for example, vary by region and year (Cerchio et al. 2001; Payne and Payne 1985). Characterizing and describing the structure of song can be a difficult task even for the experienced bioacoustician. With the assistance of computer analysis tools, sound detection and classification may be more efficient.

8.3 Detection of Animal Sounds

The problem to be solved may seem simple. For example, a bioacoustician deployed an autonomous recorder in the field for a month, and after recovery of the gear, downloaded all data in the laboratory and now wants to pick all frog calls recorded in order to study the mating behavior of this species. Listening to the first few minutes of recording, the bioacoustician can easily hear the target species, but there are calls every few seconds—too many to pick by hand. So, the scientist looks for software tools to help detect all frog signals, and potentially sort them based on their acoustic features. The first step, signal detection, is discussed in Sect. 8.3; the second step, signal classification, is discussed in Sect. 8.4.

Automated signal detectors work by common principles. The raw input data are the ideally calibrated time series of pressure recorded with a microphone in air or hydrophone in water. There might be one or more pre-processing steps to filter or Fourier transform the data in successive time windows (see Chap. 4). The pre-processed time series is then fed into the detector, which computes a specific quantity from the acoustic data. This may be instantaneous energy, energy within a specified time window, entropy, or a correlation coefficient, as a few examples. Then, a detection threshold is applied. If the quantity exceeds the threshold, the signal is deemed present, otherwise not.

The threshold is commonly computed the following way:

$$ {E}_{\mathrm{th}}=\overline{E}+\gamma {\sigma}_E $$

where E symbolizes the chosen quantity (e.g., energy), \( \overline{E} \) is its mean value computed over a long time window (e.g., an entire file), σE is the standard deviation, and γ is a multiplier (integer or real). Setting a high threshold will result in only the strongest signals being detected and weaker ones being missed. Setting a low threshold will result in many false alarms, which are not signals. By varying γ, the ideal threshold may be found and the performance of the detector may be assessed (see Sect. 8.3.6).

8.3.1 Energy Threshold Detector

One of the most common methods for detecting animal sounds from recordings is to measure the energy, or amplitude, of the incoming signal in a specified frequency band and to determine whether it exceeds a user-defined threshold. If the threshold within the frequency band is exceeded, the sound is scored as being present. The threshold value typically is set relative to the ambient noise in the frequency band of interest (e.g., Mellinger 2008; Ou et al. 2012). A simple energy threshold detector does not perform well when signals have low signal-to-noise ratio (SNR) or when sounds overlap. A number of techniques have been devised to overcome these problems, including spectrogram equalization (e.g., Esfahanian et al. 2017) to reduce background noise, time-varying (adaptive) detection thresholds (e.g., Morrissey et al. 2006), and using concurrent, but different, detection thresholds for different frequency bands (e.g., Brandes 2008; Ward et al. 2008). Apart from finding individual animal sounds, energy threshold detectors also have been successfully applied to the detection of animal choruses, such as those produced by spawning fish, migrating whales (Erbe et al. 2015), and chorusing insects or amphibians. These choruses are composed of many sounds from large and often distant groups of animals and so individual signals often are not detectable in them. Choruses can last for hours and significantly raise ambient levels in a species-specific frequency band (Fig. 8.6).

Fig. 8.6
figure 6

Spectrogram showing three weeks of choruses by fish, fin whales, and blue whales in the Perth Canyon, Australia (modified from Erbe et al. 2015). Fish raised ambient levels by 20 dB in the 1800–2500 Hz band every night. Fin whales raised ambient levels by 20 dB in the 15–35 Hz band over two days. Antarctic blue whales were the cause of ongoing tones at 18 and 28 Hz for weeks at a time. Colors represent power spectral density (PSD). Black arrows point to strong noise from passing ships. © Erbe et al.; https://doi.org/10.1016/j.pocean.2015.05.015. Licensed under CC BY 4.0; https://creativecommons.org/licenses/by/4.0/

8.3.2 Spectrogram Cross-Correlation

Spectrogram cross-correlation is a well-known technique to detect sounds produced by many species, such as rockfish (genus Sebastes; Širović et al. 2009), African elephants (Loxodonta africana; Venter and Hanekom 2010), maned wolves (Chrysocyon brachyurus; Rocha et al. 2015), minke whales (Oswald et al. 2011), and sei whales (Balaenoptera borealis; Baumgartner and Fratantoni 2008). In this method, spectrograms of reference sounds from the species of interest are converted into reference coefficients, or kernels, with one kernel for each sound type (Fig. 8.7). These reference kernels then are cross-correlated with the incoming spectrogram on a frame-by-frame basis. Kernels can be a statistical representation of spectrograms of known sound types, or they can be created empirically by trial-and-error from previously analyzed recordings.

Fig. 8.7
figure 7

Spectrogram of the kernel for Omura’s whales’ (Balaenoptera omurai) doublet calls, computed as the average of over 800 hand-picked calls (Madhusudhana et al. 2020)

Proper selection of reference signals is critical to the performance of the detector and thus this method is only suited for detection of stereotypical sounds. Seasonal and annual variability in call structure can significantly impact performance of these detectors and so an analysis of the variability in call structure is vital when applying spectrogram cross-correlation to detect calls in long-term datasets (Širović 2016). Another drawback to this method is that it can be prohibitively processor-intensive. To speed up the calculations, Harland (2008) first employed an energy threshold detector (as described above) to detect times of potential signal presence and then used spectrogram cross-correlation to detect individual signals within the flagged time periods.

8.3.3 Matched Filter

The matched filter approach for sound classification is similar to spectrogram cross-correlation but is performed in the time-domain. This means that the waveforms (i.e., sound pressure levels as a function of time) are correlated instead of the spectrogram. A kernel of the waveform of the sound to be detected is produced, often empirically using a high-quality recording, and then cross-correlated with the incoming signal (i.e., the time series of sound pressure). Matched filters are efficient at detecting signals in Gaussian noise (white noise), but colored noise (typical in many natural environments) poses more of a problem. As with spectrogram cross-correlation, the selection of kernels is critical to the performance of the detector. Matched filters are only appropriate for detection of well-known, stereotyped acoustic features, such as sounds produced by cane toads (Dang et al. 2008), blue whales (Stafford et al. 1998; Bouffaut et al. 2018), and beaked whales (Hamilton and Cleary 2010). Their performance suffers in the presence of even a small amount of sound variation compared to the kernel.

8.3.4 Spectral Entropy Detector

In general, entropy measures the disorder or uncertainty of a system. Applied to communication theory, the information entropy (also called Shannon entropy; Shannon and Weaver 1998) measures the amount of information contained in a data stream. Entropy is computed as the negative product of a probability distribution and its logarithm. Therefore, a strongly peaked probability distribution has low entropy, while a broad probability distribution has high entropy. If applied to an acoustic power spectral density distribution, entropy measures the peakedness of the power spectra and detects narrowband signals in broadband noise (Fig. 8.8). Spectral entropy has successfully been applied to animal sounds; for example, from birds, beluga whales (Delphinapterus leucas), bowhead whales, and walruses (Erbe and King 2008; Mellinger and Bradbury 2007; Valente et al. 2007).

Fig. 8.8
figure 8

Spectrogram of marine mammal tonal sounds with negative entropy (black curve) overlain. Negative entropy is high when the power spectral density is concentrated in a few narrow frequency bands (Erbe and King 2008)

8.3.5 Teager–Kaiser Energy Operator

The Teager–Kaiser energy operator (TKEO) is a nonlinear operator that tracks the energy of a data stream (Fig. 8.9). Operating on a time series, at any one instance, the TKEO computes the square of the sample and subtracts the product of the previous and next sample. The output is therefore high for very brief signals. The TKEO has successfully been applied to the detection of clicks, such as bat or odontocete biosonar sounds (Kandia and Stylianou 2006; Klinck and Mellinger 2011). Many biosonar signals are of Gabor type (i.e., a sinusoid modulated by a Gaussian envelope). The TKEO output of the signals is a simple Gaussian, which can be detected with simple tools, such as energy threshold detection or matched filtering (Madhusudhana et al. 2015).

Fig. 8.9
figure 9

Waveforms of odontocete clicks and their Gabor fit (top) and TKEO outputs and Gaussian fit (bottom) (Madhusudhana et al. 2015)

8.3.6 Evaluating the Performance of Automated Detectors

Automated detectors can produce two types of errors: missed detections (i.e., missing a sound that exists) and false alarms (i.e., incorrectly reporting a sound that does not exist or reporting a sound that is not the target signal). There is an inevitable trade-off when choosing the acceptable rate of each. Most detectors allow the user to adjust a threshold, and depending on where this threshold is set, the probability of one type of error increases while the other decreases. The acceptability of either type of error is determined by the particular application of the detector. For example, for rare animals in critical habitats, detecting every sound, even those that are very faint, is desired. In this situation, a low threshold can be chosen that minimizes the number of missed detections; however, this can result in many false alarms. Quantification of these two errors is a useful way to evaluate the performance of an automated detector.

8.3.6.1 Confusion Matrices

One of the simplest and most common methods for conveying the performance of a detector (or a classifier) is a confusion matrix (i.e., a type of contingency table). A confusion matrix (Fig. 8.10) gives the number of true positives (i.e., correctly classified sounds, also called correct detections), false positives (i.e., false alarms), true negatives (i.e., correct rejections), and false negatives (i.e., missed detections).

Fig. 8.10
figure 10

Confusion matrix showing the possible outcomes of a detector when a signal is present versus absent

8.3.6.2 Receiver Operating Characteristic (ROC) Curve

The performance of detectors can be visualized using the receiver operating characteristic (ROC) curve. A ROC curve is a graph that depicts the trade-offs between true positives and false positives (Egan 1975; Swets et al. 2000). The false positive rate (i.e., FP/(FP+TN)) is plotted on the x-axis, while the true positive rate (i.e., TP/(TP+FN)) is plotted on the y-axis (Fig. 8.11). A curve is generated by plotting these values for the detector at different threshold values. The (0|1) point on the graph represents perfect performance: 100% true positives and no false positives.

Fig. 8.11
figure 11

(a) Generalized receiver operating characteristic (ROC) plot, in which the probability of true positives is plotted against the probability of false positives. Areas in this graph that correspond to a liberal bias, conservative bias, and deliberate mistakes are indicated. (b) Example ROC curves computed during the development of automated detectors for marine mammal calls in the Arctic. The spectral entropy detector outperformed others (Erbe and King 2008)

The major diagonal in Fig. 8.11a represents performance at chance, where the probabilities of TP and FP are equal. Responses falling below the line would indicate deliberate mistakes. The minor diagonal represents neutral bias, and splits responses into conservative versus liberal. A conservative response strategy yields decreased correct detection and false alarm probabilities; a liberal response strategy yields increased correct detection and false alarm probabilities. An example ROC curve is given in Fig. 8.11b, comparing the performances of three detectors (operating on underwater acoustic recordings from the Arctic and trying to detect marine mammal calls) based on: (1) spectral entropy, (2) bandpassed energy, and (3) waveform (i.e., broadband) energy. The performance of the entropy detector surpassed that of the other two.

8.3.6.3 Precision and Recall

The performance of a detector can be over-estimated using a ROC curve when there is a large difference between the numbers of TPs and TNs. In addition, estimation of the number of TNs requires discrete sampling units. The duration of the discrete sampling units is often somewhat arbitrary and can lead to unrealistic differences between the numbers of TPs and TNs. In these situations, precision and recall (P-R) can provide a more accurate representation of detector performance because this representation does not rely on determining the number of true negatives (Davis and Goadrich 2006). In the P-R framework, events are scored only as TPs, FPs, and FNs.

Precision is a measure of accuracy and is the proportion of automated detections that are true detections.

$$ \mathrm{Precision}=\frac{TP}{TP+ FP} $$

Recall is a measure of completeness and is the proportion of true events that are detected. This is the same as the true positive rate defined in the ROC framework.

$$ \mathrm{Recall}=\frac{TP}{TP+ FN} $$

Detectors can be evaluated by plotting precision against recall (Fig. 8.12). An ideal detector would have both scores approaching a value of 1. In other words, the curve would approach the upper right-hand corner of the graph (Davis and Goadrich 2006). Precision and recall also can be represented by an F-score, which is the geometric mean of these values. The F-score can be weighted to emphasize either precision or recall when optimizing detector performance (Jacobson et al. 2013).

Fig. 8.12
figure 12

Precision-Recall curves for three types of detectors: (1) spectrogram cross-correlation, (2) blob detection, and (3) spectral entropy for Omura’s whale calls (Madhusudhana et al. 2020)

8.4 Quantitative Classification of Animal Sounds

Quantitative classification of animal sounds is based on measured features of sounds, no matter whether these are used to manually or automatically group sounds with the aid of software algorithms. These features can be measured from different representations of sounds, such as waveforms, power spectra, spectrograms, and others. A large variety of classification methods have been applied to animal sounds, many drawing from human speech analysis.

8.4.1 Feature Selection

The acoustic features selected and the consistency with which the measurements are taken have a significant influence on the success (or failure) of a classification algorithm. Feature sets (also called feature vectors) should provide as much information as sensible about the sounds. With today’s software tools and computing power, a limitless number of features can easily be measured that would allow distinction between sounds even of the same type. Such over-parameterization can make it difficult to group like sounds, which can be just as important as distinguishing between different sounds. The challenge is to find the trade-off and produce a set of representative features for each sound type. Once the features have been selected, automating the extraction and subsequent analysis of these features reduces the time required to analyze large datasets. Some commonly used feature vectors are described below.

8.4.1.1 Spectrographic Features

Perhaps the most commonly used feature vectors are those consisting of values measured from spectrograms. These measurements include, but are not limited to, frequency variables (e.g., frequency at the beginning of the sound, frequency at the end of the sound, minimum frequency, maximum frequency, frequency of peak energy, bandwidth, and presence/absence of harmonics or sidebands; Fig. 8.13; also see Chap. 4, Sect. 4.2.3), and time variables (e.g., signal duration, phrase and song length, inter-signal intervals, and repetition rate). More complex features, such as those describing the spectrographic shape of a sound (e.g., upsweep, downsweep, chevron, U-loop, inverted U-loop, or warble), slopes, and numbers and relative positions of local extrema and inflection points (places where the contour changes from positive to negative slope or vice versa) also have been used in classification. These measurements often are taken manually from spectrographic displays (e.g., by a technician using a mouse-controlled cursor). Automated techniques for extracting spectrographic measurements can be less subjective and less time-consuming, but are sometimes not as accurate as manual methods. Examples are available in the bird literature (e.g., Tchernichovski et al. 2000), bat literature (Gannon et al. 2004; O’Farrell et al. 1999), and marine mammal literature (e.g., Mellinger et al. 2011; Roch et al. 2011; Gillespie et al. 2013; Kershenbaum et al. 2016). Spectrographic measurements of bat calls, for example, can be extracted using Analook (Titley Scientific, Columbia, MO, USA), SonoBat (Joe Szewczak, Department of Biology, Humboldt State University, Arcata, CA, USA), or Kaleidoscope Pro (Wildlife Acoustics, Inc., Maynard, MA, USA), exported to an Excel spreadsheet (XML, CSV, and other formats), classified using machine learning algorithms, and compared to a reference library for identification.

Fig. 8.13
figure 13

Spectrogram of a pilot whale (Globicephala melas) whistle showing the following features: Start frequency (Start f), End frequency (End f), Maximum frequency (Max f), Minimum frequency (Min f), locations of two local maxima and one local minimum in the fundamental contour, four inflection points (where the curvature changes from clockwise to counter-clockwise, or vice versa), and one overtone (Courts et al. 2020). © Courts et al.; https://www.nature.com/articles/s41598-020-74111-y/figures/5. Licensed under CC BY 4.0; https://creativecommons.org/licenses/by/4.0/

8.4.1.2 Cepstral Features

Cepstral coefficients are spectral features of bioacoustic signals commonly used in human speech processing (Davis and Mermelstein 1980). These features are based on the source-filter model of human speech analysis, which has been applied to many different animal species (Fitch 2003). Cepstral coefficients are well-suited for statistical pattern-recognition models because they tend to be uncorrelated (Clemins et al. 2005), which significantly reduces the number of parameters that must be estimated (Picone 1993). Cepstral coefficients are calculated by computing the Fourier transform in successive time windows over the recorded pressure time series of a sound (see Chap. 4). The frequency axis then is warped by multiplying the spectrum with a series of n filter functions at appropriately spaced frequencies. This is done because there is evidence that many animals perceive frequencies on a logarithmic scale, in a similar fashion to humans (Clemins et al. 2005). The output of the frequency band filters is then used as input to a discrete cosine transform, which results in an n-dimensional cepstral feature vector (Picone 1993; Clemins et al. 2005; Roch et al. 2007, 2008).

Using cepstral feature space allows the timbre of sounds to be captured, a quality that is lost when extracting parameters from spectrograms. Roch et al. (2007) developed an automated classification system based on cepstral feature vectors extracted for whistles, burst-pulse sounds, and clicks produced by short- and long-beaked common dolphins (Delphinus spp.), Pacific white-sided dolphins (Lagenorhynchus obliquidens), and bottlenose dolphins (Tursiops truncatus). The system did not rely on specific sound types and had no requirement for separating individual sounds. The system performed relatively well, with correct classification scores of 65–75%, depending on the partitioning of the training- and test-data. Cepstral feature vectors also have been used as input to classifiers for many other animal species, including groupers (Epinephelus guttatus, E. striatus, Mycteroperca venenosa, M. bonaci; Ibrahim et al. 2018), frogs (Gingras and Fitch 2013), song birds (Somervuo et al. 2006), African elephants (Zeppelzauer et al. 2015), and beluga, bowhead, gray (Eschrichtius robustus), humpback, and killer (Orcinus orca) whales, and walrus (Mouy et al. 2008). Cepstral features appear to be a promising alternative to the traditional time- and frequency-parameters measured from spectrograms as input to classification algorithms. However, cepstral features are relatively sensitive to the SNR, the signal’s phase, and modeling order (Ghosh et al. 1992).

Noda et al. (2016) used mel-frequency cepstral coefficients and random forest analyses to classify sounds produced by 102 species of fish and compared the performance of three classifiers: k-nearest neighbors, random forest, and support vector machines (SVMs). The mel-frequency cepstrum (or cepstrogram) is a form of acoustic power spectrum (or spectrogram) that is computed as a linear cosine transform of a log-power spectrum that is presented on a nonlinear mel-scale of frequency. The mel-scale resembles the human auditory system better than the linearly-spaced frequency bands of the normal cepstrum. All three classifiers performed similarly, with average classification accuracy ranging between 93% and 95%.

8.4.2 Statistical Classification of Animal Sounds

For some sounds, qualitative classification is sufficient. Janik (1999) reported that humans were able to identify dolphin signature whistles more reliably than computer methods. A problem with qualitative classification of sounds in a repertoire (and taxonomy in general), however, is that some listeners are “splitters” and other listeners are “lumpers.” So, even researchers on the same project could classify an animal’s sound repertoire differently. One way to avoid individual researcher differences in classification is to use graphical, statistical, and computer-automated methods that objectively sort and compare measured variables that describe the sounds. A variety of statistical methods can be employed to classify animal sounds into categories (Frommolt et al. 2007). Below are brief descriptions of some of the statistical methods that are commonly used for classification of animal sounds.

8.4.2.1 Parametric Clustering

Parametric cluster analysis produces a dendrogram (i.e., classification tree) that organizes similar sounds into branches of a tree. A distance matrix also is generated, which gives correlation coefficients between all variables in the dataset. The resulting distance index ranges from 0 (very similar sounds) to 1 (totally dissimilar sounds). The matrix can then be joined by rows or columns to examine relationships. The type of linkage and type of distance measurement can be selected to find the best fit for a particular dataset (Zar 2009).

Cluster analysis has been used to classify sound types in several species, including owls (Nagy and Rockwell 2012), mice (Hammerschmidt et al. 2012), rats (Rattus norvegicus, Takahashi et al. 2010), African elephants (Wood et al. 2005), and primates (Hammerschmidt and Fischer 1998). In a study of six populations of the neotropical frog (Proceratophrys moratoi) in Brazil, Forti et al. (2016) measured spectrographic variables from calls produced by males and performed cluster analysis to examine similarities in acoustic traits (based on the Bray–Curtis index of acoustic similarity) across the six locations (Fig. 8.14). Baptista and Gaunt (1997) used hierarchical cluster analysis of correlation coefficients of several acoustic parameters to categorize sounds of the sparkling violet-eared hummingbird (Colibri coruscans), which is found in two neighboring assemblages in their study area. A matrix of sound similarity values obtained from spectral cross-correlation of these birds’ songs indicated similar sound types from the two areas. Yang et al. (2007) used cluster analysis to examine syllable sharing between individuals of Anna’s hummingbird (Calypte anna). They identified 38 syllable types in songs of 44 males, which clustered into five basic syllable categories: “Bzz,” “bzz,” “chur,” “ZWEE,” and “dz!”. Also, microgeographic song variation patterns were found in that nearest neighbors sang more similar songs than non-neighbors. Pozzi et al. (2010) used several acoustic variables to group black lemur (Eulemur macaco macaco) sounds into categories, including the frequencies of the fundamental and of the first three harmonic overtones (measured at the start, middle, and end of each call), and the total duration. The agreement of this analysis with manual classification was high (>88.4%) for six of eight categories.

Fig. 8.14
figure 14

Dendrogram from a hierarchical cluster analysis of the call similarities between 15 male Proceratophrys moratoi from different sites and two other Odontophrynidae species (Forti et al. 2016). © Forti et al.; https://peerj.com/articles/2014/. Licensed under CC BY 4.0; https://creativecommons.org/licenses/by/4.0/

8.4.2.2 Principal Component Analysis

Principal component analysis (PCA) is a multivariate statistical method that examines a set of measurements such as the feature vectors discussed earlier in Sect. 8.4. These features may well be correlated. For example, bandwidth is sometimes correlated with maximum frequency, or the number of inflection points can be correlated with signal duration (Ward et al. 2016). PCA performs an orthogonal transformation that converts the potentially correlated variables (i.e., the features) into a set of linearly uncorrelated variables (i.e., the principal components; Hotelling 1933; Zar 2009). The principal components are linear combinations of the original variables (features). Plotting the principal components against each other shows how the measurements cluster.

For example, by examining bat biosonar signals in multivariate space, bat species that are very similar in external appearance can be distinguished. Using PCA, Gannon et al. (2001) found ear height and characteristic frequency were correlated, along with duration of the signal (Fig. 8.15).

Fig. 8.15
figure 15

Plot showing the results of principal component analysis, in which two cryptic species of myotis bats (California myotis, Myotis californicus, MYCA, black squares; western small-footed bat, M. ciliolabrum, MYCI, hollow circles) were distinguished by differences in ear height and characteristic frequency of their echolocation signals. Plotted is characteristic frequency versus signal duration for these species recorded from field sites in New Mexico and Arizona, USA

As another example, Briefer et al. (2015) categorized emotional states associated with variation in whinnies from 20 domestic horses (Equus ferus) using PCA. They designed four situations to elicit different levels of emotional arousal that were likely to stimulate whinnies: separation (negative situation) and reunion (positive situation) with either all group members (high emotional arousal) or only one group member (moderate emotional arousal). The authors measured 21 acoustic features from whinnies (Fig. 8.16). PCA transformed the feature vectors into six principal components that accounted for 83% of the variance in the original dataset.

Fig. 8.16
figure 16

Spectrograms and oscillograms of horse whinnies in negative (a, c) and positive (b, d) situations emitted by two different horses. Red arrows point to fundamental frequencies (F0, G0) and first overtones (H1). Negative whinnies (a, c) are longer in duration and have higher G0 fundamentals than positive whinnies (b, d Briefer et al. 2015). © Briefer et al.; https://www.nature.com/articles/srep09989/figures/3. Licensed under CC BY 4.0; http://creativecommons.org/licenses/by/4.0/

8.4.2.3 Discriminant Function Analysis

In discriminant function analysis (DFA), canonical discriminant functions are calculated using variables measured from a training dataset. One canonical discriminant function is produced for each sound type in the dataset. Variables measured from sounds in the test dataset are then substituted into each function and each sound type is classified according to the function that produced the highest value. Because DFA is a parametric technique, it is assumed that input data have a multivariate normal distribution with the same covariance matrix (Afifi and Clark 1996; Zar 2009). Violations of these assumptions can create problems with some datasets. One of the main weaknesses of DFA for animal sound classification is that it assumes classes are linearly separable. Because a linear combination of variables takes place in this analysis, the feature space can only be separated in certain, restricted ways that are not appropriate for all animal sounds. Figure 8.17 shows the DFA separation of California chipmunk (genus Neotamias) taxa that are morphologically similar but acoustically different, using six variables measured from their sounds.

Fig. 8.17
figure 17

Plot resulting from discriminant function analysis. Four species of Townsend-group chipmunks (Townsend’s chipmunk, Neotamias townsendii; Siskiyou chipmunk, N. siskiyou; Allen’s chipmunk, N. senex; and yellow-cheeked chipmunk, N. ochrogenys) in northern California, USA, produced discernibly different sounds. Discriminant function 1 was dominated by differences in maximum frequency of the signal and discriminant function 2 was most influenced by temporal features including total signal length and the number of signals emitted by a chipmunk during a signaling bout

8.4.2.4 Classification Trees

Classification tree analysis is a non-parametric statistical technique that recursively partitions data into groups known as “nodes” through a series of binary splits of the dataset (Clark and Pregibon 1992; Breiman et al. 1984). Each split is based on a value for a single variable and the criteria for making splits are known as primary splitting rules. The goal for each split is to divide the data into two nodes, each as homogeneous as possible. As the tree is grown, results are split into successively purer nodes. This continues until each node contains perfectly homogeneous data (Gillespie and Caillat 2008). Once this maximal tree has been generated, it is pruned by removing nodes and examining the error rates of these smaller trees. The smallest tree with the highest predictive accuracy is the optimal tree (Oswald et al. 2003).

Tree-based analysis provides several advantages over some of the other classification techniques. It is a non-parametric technique; therefore, data do not need to be normally distributed as required for other methods, such as DFA. In addition, tree-based analysis is a simple and naturally intuitive way for humans to classify sounds. It is essentially a series of true/false questions, which makes the classification process transparent. This allows easy examination of which variables are most important in the classification process. Tree-based analysis also accommodates for a high degree of diversity within classes. For example, if a species produces two or more distinct sound types, a tree-based analysis can create two different nodes. In other classification techniques, different sound types within a species simply act to increase variability and make classification more difficult. Finally, surrogate splitters are provided at each node (Oswald et al. 2003). Surrogate splitters closely follow primary splitting rules and can be used in cases when the primary splitting variable is missing. Therefore, sounds can be classified even if data for some variables are missing due to noise or other factors.

To address some controversy as to whether closely related species of myotis bats could be differentiated by their sounds, Gannon et al. (2004) completed an analysis of echolocation pulses from free-flying, wild bats. Fig. 8.18 is a classification tree grown from nearly 1400 calls using at least seven variables measured from each call. The tree produced terminal nodes identified to species (MYVO is Myotis volans, MYCA M. californicus, etc.). In this study, recordings were made under field conditions where sounds were affected by the environment, Doppler shift, and diversity of equipment. Still, classification trees worked well to predict group membership and additional techniques, such as DFA, were able to distinguish five Myotis species acoustically with greater than 75% accuracy (greater than 90% in most instances).

Fig. 8.18
figure 18

Classification tree grown using Splus computer software (version S-PLUS 6.2 2003, TIBCO Software Inc., Palo Alto, CA, USA) from 1369 bat calls. The pruned tree used variables measured from each bat call: duration (DUR), minimum frequency (Fmin), characteristic frequency (Fc; i.e., frequency at the flattest part of the call), frequency at the “knee” of the call (Fk), time of Fc, time at Fk, and slope (S1). Along the tangents between boxes are values for variables used to split the nodes (for instance, Fmin is minimum frequency). The fraction below each box is the misclassification rate (e.g., 1/5 = 20% misclassification rate). The tree has 12 terminal nodes defining the branches, resulting in a classification designation for each species (Gannon et al. 2004)

Classification trees have been applied to marine mammal sounds by several researchers, with promising results. Fristrup and Watkins (1993) used tree-based analysis to classify the sounds of 53 species of marine mammal (including mysticetes, odontocetes, pinnipeds, and manatees). Their correct classification score of 66% was 16% higher than the score obtained when applying DFA to the same dataset. The whistles of nine delphinid species were correctly classified 53% of the time by Oswald et al. (2003) using tree-based analysis. Oswald et al. (2007) subsequently applied classification tree analysis to the whistles of seven species and one genus of marine mammal, resulting in a correct classification score of 41%. This score was improved slightly, to 46%, when classification decisions were based on a combination of classification tree and DFA results. Gannier et al. (2010) used classification trees to identify the whistles of five delphinid species recorded in the Mediterranean, with a correct classification score of 63%. Finally, Gillespie and Caillat (2008) classified the clicks of Blainville’s beaked whales (Mesoplodon densirostris), short-finned pilot whales (Globicephala macrorhynchus), and Risso’s dolphins (Grampus griseus). Their tree-based analysis classified 80% of clicks to the correct species.

8.4.2.5 Nonlinear Dimensionality Reduction

Clustering techniques described above require that certain features or measurements, as appropriate for the problem domain, be available beforehand. They are gathered from sound recordings either manually (e.g., number of inflection points in whistle contours, number of harmonics) or using signal processing tools (e.g., peak frequency, energy), or both. Manual extraction of features is usually time-consuming and often inefficient, especially when dealing with recordings covering large spatial and temporal scales. Automated extraction of measurements improves efficiency and eliminates the risk of human biases. However, when recordings contain a lot of confounding sounds or have extreme noise variations, reliability and accuracy of the measurements can become questionable and can have adverse effects on clustering outcomes. Regardless of whether manual or automated approaches were employed, the resulting limited set of chosen features or measurements are essentially representations of the underlying data in a reduced space. Such dimensionality reduction is typically aimed at making the downstream task of clustering (with PCA, DFA, etc.) computationally tractable.

In recent years, nonlinear dimensionality reduction methods have gained widespread popularity, specifically in applications for exploring and visualizing very high-dimensional data. Originally popular for processing image-like data in the field of machine learning, these methods bring about dimensionality reduction without requiring one to explicitly choose and extract features. The methods can be easily adapted for processing bioacoustic recordings wherein the qualitative cluster structure (i.e., similarities in the visually identifiable information) in spectrogram-like data (e.g., mel-spectrogram or cepstrogram) containing hundreds or thousands of time-frequency points is effectively captured in an equivalent 2- or 3-dimensional space (e.g., Sainburg et al. 2019; Kollmorgen et al. 2020).

One of the earlier methods for capturing nonlinear structure, the t-distributed stochastic neighbor embedding (t-SNE; van der Maaten and Hinton 2008) is based on non-convex optimization. It computes a similarity measure between pairs of points (data samples) in the original high-dimensional space and in the reduced space, then minimizes the Kullback–Leibler divergence between the two sets of similarity measures. t-SNE tries to preserve distances in a neighborhood whereby points close together in the high-dimensional space have a high probability of staying close in the reduced space. The Bird Sounds project (Tan and McDonald 2017) presents an excellent demonstration of using t-SNE for organizing thousands of bird sound spectrograms in a 2-dimensional similarity grid.

Some of the shortcomings of t-SNE were addressed in a newer method called uniform manifold approximation and projection (UMAP; McInnes et al. 2018). UMAP is backed with a strong theoretical framework. While effectively capturing local structures like t-SNE, UMAP also offers a better promise for preserving global structures (inter-cluster relationships). UMAP processes data faster and is capable of handling very large dimensional data. Fig. 8.19 is a demonstration of the use of UMAP for clustering sounds of five species of katydids (Tettigoniidae) from Panamanian rainforest recordings (Madhusudhana et al. 2019). Inputs to UMAP clustering comprised of spectrograms (dimensions 216h x 469w) computed from 1-s clips containing katydid call(s). The inputs often contained confounding sounds and varying noise levels. The clustering results, however, demonstrate the utility of UMAP as a quick means to effective clustering. UMAP has also been used, in combination with a pre-trained neural network, for assessing habitat quality and biodiversity variations from soundscape recordings across different ecosystems (Sethi et al. 2020).

Fig. 8.19
figure 19

Demonstration of clustering katydid sounds using UMAP. Randomly chosen samples of call spectrograms of the five species considered are shown on the left, and clustering outcomes are shown on the right. The clustering activity has successfully captured both inter-species and intra-species variations

We have presented here two popular methods that are currently trending in this field of research. There are, however, other alternatives available including earlier methods such as isomap (Tenenbaum et al. 2000) and diffusion map (Coifman et al. 2005), newer variants of t-SNE (e.g., Maaten 2014; Linderman et al. 2017), and some modern variants of variational autoencoders (Kingma and Welling 2013).

8.4.3 Model Based Classification

8.4.3.1 Artificial Neural Networks

Artificial neural networks (ANNs) were developed by modeling biological systems of information-processing (Rosenblatt 1958) and became very popular in the areas of word recognition in human speech studies (e.g., Waibel et al. 1989; Gemello and Mana 1991) and character or image-recognition (e.g., Fukushima and Wake 1990; Van Allen et al. 1990; Belliustin et al. 1991) in the 1980s. Since that time, ANNs have been used successfully to classify a number of complex signal types, including quail crows (Coturnix spp., Deregnaucourt et al. 2001), alarm sounds of Gunnison’s prairie dogs (Cynomys gunnisoni, Placer and Slobodchikoff 2000), stress sounds by domestic pigs (Sus scrofa domesticus, Schon et al. 2001), and dolphin echolocation clicks (Roitblat et al. 1989; Au and Nachtigall 1995).

In their primitive forms, there are 20 or more basic architectures of ANNs (see Lippman 1989 for a review). Each ANN approach results in trade-offs in computer memory and computation requirements, training complexity, and time and ease of implementation and adaptation (Lippman 1989). The choice of ANN depends on the type of problem to be solved, size and complexity of the dataset, and the computational resources available. All ANNs are composed of units called neurons and connections among them. They typically consist of three or more neuron layers: one input layer, one output layer, and one or more hidden layers (Fig. 8.20). The input layer consists of n neurons that code for n features in the feature vector representing the signal (X1Xn). The output layer consists of k neurons representing the k classes. The number of hidden layers between the input and output layers, as well as the number of neurons per layer, is empirically chosen by the researcher. Each connection among neurons in the network is associated with a weight-value, which is modified by successive iterations during the training of the network.

Fig. 8.20
figure 20

Diagram of the structure of an artificial neural network

ANNs are promising for automatic signal classification for several reasons. First, the input to an ANN can range from feature vectors of measurements taken from spectrograms or waveforms, to frequency contours, to complete spectrograms. Second, ANNs serve as adaptive classifiers which learn through examples. As a result, it is not necessary to develop a good mathematical model for the underlying signal characteristics before analysis begins (Ghosh et al. 1992). In addition, ANNs are nonlinear estimators that are well-suited for problems involving arbitrary distributions and noisy input (Ghosh et al. 1992; Potter et al. 1994).

Dawson et al. (2006) used artificial neural networks as a means to classify the chick-a-dee-dee-dee call of the black-capped chickadee (Poecile atricapillus), which contains four note types carrying important functional roles in this species. In their study, an ANN first was trained to identify the note type based on several acoustic variables and then correctly classified recordings of the notes with 98% accuracy. The performance of the network was compared with classification using DFA, which also achieved a high level of correct classification (95%). The authors concluded that “there is little reason to prefer one technique over another. Either method would perform extremely well as a note-classification tool in a research laboratory” (Dawson et al. 2006).

Placer and Slobodchikoff (2000) used artificial neural networks to classify alarm sounds of Gunnison’s prairie dogs (Cynomys gunnisoni) to predator species with a classification accuracy of 78.6 to 96.3%. The ANN identified unique signals for four different species of predators: red-tailed hawk (Buteo jamaicensis), domestic dog (Canis familiaris), coyote (Canis latrans), and humans (Homo sapiens).

Deecke et al. (1999) used artificial neural networks to examine dialects in underwater sounds of killer whale pods. The neural network extracted the frequency contours of one sound type shared by nine social groups of killer whales and created a neural network similarity index. Results were compared to the sound similarity judged by three humans in pair-wise classification tasks. Similarity ratings of the neural network mostly agreed with those of the humans, and were significantly correlated with the killer whale group, indicating that the similarity indices were biologically meaningful. According to the authors, “an index based on neural network analysis therefore represents an objective and repeatable means of measuring acoustic similarity, and allows comparison of results across studies, species, and time” (Deecke et al. 1999).

The greater potential of ANNs remained largely untapped for many years, in part due to prevailing limitations in computational capabilities. In the mid-1980s, backpropagation paved a way for efficiently training multi-layer ANNs (Rumelhart et al. 1986). Backpropagation, an algorithm for supervised learning of the weights in an ANN using gradient descent, greatly facilitated development of deeper networks (having many hidden layers). Many classes of deep neural networks (DNNs; LeCun et al. 2015) such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) became easier to train. While the aforementioned ANN approaches often require hand-picked features or measurements as inputs, DNNs trained with backpropagation demonstrated the ability to learn good internal representations from raw data (i.e., the hidden layers captured non-trivial representations effectively). In their landmark work on using CNNs for the automatic recognition of handwritten digits, LeCun et al. (1989a, b) used backpropagation to learn convolutional kernel coefficients directly from images. Over the past two decades, advances in computing technology, especially the wider availability of graphics processing units (GPUs), have considerably accelerated machine learning (ML) research in many disciplines such as computer vision, speech processing, natural language processing, recommendation systems, etc. Shift invariance is an attractive characteristic of CNNs, which makes them suitable for analyzing visual imagery (LeCun et al. 1989a, b, 1998). CNN-based solutions have consistently dominated many of the large-scale visual recognition challenges. As such, several competing architectures of CNNs have been developed: AlexNet (Krizhevsky et al. 2017), ResNet (He et al. 2016), DenseNet (Huang et al. 2017), etc. Some of these architectures have become the state-of-the-art in computer vision applications such as face recognition, emotion detection, object extraction, scene classification, and also in conservation applications (e.g., species identification in camera trap data, land-use monitoring in aerial surveys). Given the image-like nature of time-frequency representations of acoustic signals (e.g., spectrogram), many of the successes of CNNs in computer vision have been replicated in the field of animal bioacoustics. In contrast to CNNs, RNNs are better suited for processing sequence inputs. RNNs contain internal states (memory) that allow them to “learn” temporal patterns. However, their utility is limited by the “vanishing gradient problem,” wherein the gradients (from the gradient descent algorithm) of the network's output with respect to the weights in the early layers become extremely small. The problem is overcome in modern flavors of RNNs such as long short-term memory (LSTM; Hochreiter and Schmidhuber 1997) networks and gated recurrent unit (GRU; Cho et al. 2014) networks.

These types of ML solutions are heavily data-driven and often require large quantities of training samples. Typically, the training samples are time-frequency representations (e.g., spectrogram or mel-spectrogram) of short clips of recordings (e.g., Stowell et al. 2016; Shiu et al. 2020). Robustness of the resulting models are improved by ensuring that the inputs adequately cover possible variations of the target signals and of the ambient background conditions. Data scientists employ a variety of data augmentation techniques to overcome data shortage. Some examples include introducing synthetic variations such as infusion of Gaussian noise, shifting in time (horizontal shift) and frequency content (vertical shift) (Jaitly and Hinton 2013; Ko et al. 2015; Park et al. 2019). The training process, which involves successively lowering a loss function iteratively using the backpropagation algorithm, is usually computationally intensive and is often sped up with the use of GPUs.

DNNs have been used in the automatic recognition vocalizations of insects (e.g., Madhusudhana et al. 2019), fish (e.g., Malfante et al. 2018), birds (e.g., Stowell et al. 2016; Goëau et al. 2016), bats (e.g., Mac Aodha et al. 2018), marsupials (e.g., Himawan et al. 2018), primates (e.g., Zhang et al. 2018), and marine mammals (e.g., Bergler et al. 2019). CNNs have been used in the recognition of social calls, song calls, and whistles (e.g., Jiang et al. 2019; Thomas et al. 2019). While typical 2-dimensional CNNs have been successfully used in the detection of echolocation clicks (e.g., Bermant et al. 2019), 1-dimensional CNNs (with waveforms as inputs) have been attempted as well (e.g., Luo et al. 2019). CNNs and LSTM networks have been compared in an application for classifying grouper species (Ibrahim et al. 2018) where the authors observed similar performances between the two models. Shiu et al. (2020) attempted combining a CNN with a GRU network for detecting North Atlantic right whale (Eubalaena glacialis) calls. Madhusudhana et al. (2021) incorporated long-term temporal context by combining independently trained CNNs and LSTM networks and achieved notable improvements in recognition performance. An attractive approach for developing recognition models is the use of transfer learning technique (Torrey and Shavlik 2010), where components of an already trained model are reused. Typically, weights of the early layers of a pre-trained network are frozen (no longer trainable) and the model is adapted to the target domain by training only the leaf nodes with data from the target domain. Zhong et al. (2020) used transfer learning to produce a CNN model for classifying the calls of a few species of frogs and birds.

8.4.3.2 Random Forest Analysis

A random forest is a collection of many (hundreds or thousands) individual classification trees, which are grown without pruning. Each tree is different from every other tree in the forest because at each node, the variable to be used as a splitter is chosen from a random subset of the variables (Breiman 2001). Each tree in the forest produces a predicted category for the sound to be classified as, and the sound is ultimately classified as the category that was predicted by the majority of trees. Random forests are often more accurate than single classification trees because they are robust to over-fitting and stable to small perturbations in the data, correlations between predictor variables, and noisy predictor variables. Random forests perform well on polymorphic categories such as the variety of flight calls produced by many bird species (e.g., Liaw and Wiener 2002; Cutler et al. 2007; Armitage and Ober 2010; Ross and Allen 2014).

One of the advantages of a random forest analysis is that it provides information on the degree to which each one of the input variables contributes to the final species classification. This information is given by the Gini index and is known as the Gini variable importance. The Gini index is calculated based on the “purity” of each node in each of the classification trees, where purity is a measure of the number of whistles from different species in a given node (Breiman et al. 1984). Smaller Gini indices represent higher purity. When a random forest analysis is run, the algorithm assigns splitting variables so that the Gini index is minimized at each node (Oh et al. 2003). When a forest has been grown, the Gini importance value is calculated for each variable by summing the decreases in Gini index from one node to the next each time the variable is used. Variables are ranked according to their Gini importance values—those with the highest values contribute the most to the random forest model predictions. Random forests also produce a proximity measure, which is the fraction of trees in which particular observations end up in the same terminal nodes. This measure provides information about the similarity of individual observations because similar observations should end up in the same terminal nodes more often than dissimilar observations (Liaw and Wiener 2002).

Armitage and Ober (2010) compared the classification performance of random forests, support vector machines (SVMs), artificial neural networks, and DFA for bat echolocation signals and found that, with the exception of DFA, which had the lowest classification accuracy, all classifiers performed similarly. Keen et al. (2014) compared the performance of four classification algorithms using spectrographic measurements (spectrographic cross-correlation, dynamic time-warping, Euclidean distance, and random forest) for flight calls from four warbler species. In this study, random forests produced the most accurate results, correctly classifying 68% of calls.

Oswald et al. (2013) compared classifiers generated using DFA versus random forest classifiers for whistles produced by eight delphinid species recorded in the tropical Pacific Ocean and found that random forests resulted in the highest overall correct classification score. Rankin et al. (2016) trained a random forest classifier for five delphinid species in the California Current ecosystem. This classifier used information from whistles, clicks, and burst-pulse sounds and correctly classified 84% of acoustic encounters. Both Oswald et al. (2013) and Rankin et al. (2016) used spectrographic measurements as input variables for their classifiers.

8.4.3.3 Gaussian Mixture Models

Gaussian Mixture Models (GMMs) are used commonly to model arbitrary distributions as linear combinations of parametric variables. They are appropriate for species identification when there are no expectations, such as the sequence of sounds (Roch et al. 2007). To create a GMM, a set of n normal distributions with separate means and diagonal covariance matrices are scaled by weight-factors ci (1 < i < n). The sum over all ci must be 1 to ensure that the GMM represents a probability distribution (Huang et al. 2001; Roch et al. 2007, 2008). The number of mixtures in the GMM is chosen empirically and its parameters are estimated using an iterative algorithm, such as the Expectation Maximization algorithm (Moon 1996). Once a GMM has been trained, likelihood is computed for each sound type and a log-likelihood-ratio test is used to decide the species (Roch et al. 2008).

Gingras and Fitch (2013) used GMMs to classify male advertisement songs of four genera of anurans (Bufo, Hyla, Leptodactylus, Rana) based on spectral features and mel-frequency cepstral coefficients. The GMM based on spectral features resulted in 60% true positives and 13% false positives, and the GMM based on mel-frequency cepstral coefficients resulted in 41% true positives and 20% false positives. Somervuo et al. (2006) correctly classified 55–71% of song fragments from 14 different species of birds based on mel-frequency cepstral coefficients. The correct classification score depended on the number of cepstral coefficients and the number of Gaussian mixtures in the model. Lee et al. (2013) used GMMs to classify song segments of 28 species of birds based on image-shape features instead of traditional spectrographic features. This approach resulted in 86% or 95% classification accuracy for 3- or 5-s birdsong segments, respectively.

Roch et al. (2008) classified clicks produced by Blainville’s beaked whales, pilot whales, and Risso’s dolphins using a GMM. Correct classification scores for these three species were 96.7%, 83.2%, and 99.9%, respectively. Brown and Smaragdis (2008, 2009) used GMMs to classify sounds of killer whales, resulting in up to 92% agreement with 75 perceptually created categories of sound types, depending on the number of cepstral coefficients and Gaussians in the estimate of the probability density function. GMMs were used to classify the A and B type sounds produced by blue whales in the Northeast Pacific (McLaughlin et al. 2008), and six marine mammal species (Mouy et al. 2008) recorded in the Chukchi Sea: bowhead whales, humpback whales, gray whales, beluga whales, killer whales, and walruses. Both studies reported that their classifiers worked very well, but correct classification scores were not provided.

8.4.3.4 Support Vector Machines

Support vector machines (SVMs) are a rich family of learning algorithms based on Vapnik’s (1998) statistical learning theory. An SVM works by mapping features measured from sounds into a high-dimensional feature space. The SVM then finds the optimal hyperplane (function) that maximizes the separation among classes with the lowest number of parameters and the lowest risk of error. This approach attempts to meet the goal of minimizing both the training error and the complexity of the classifier (Mazhar et al. 2007). The best hyperplane is one that maximizes the distance between the hyperplane and the nearest data points belonging to different classes. The support vectors are the data points that determine the position of the hyperplane, and the distance between the hyperplane and the support vectors is called the margin (Fig. 8.21). The optimal classifier maximizes the margin on both sides of the hyperplane. Because the hyperplane can be defined by only a few of the training samples, SVMs tend to be generalized and robust (Cortes and Vapnik 1995; Duda et al. 2001). When classes cannot be separated linearly, SVMs can map features onto a higher dimensional space where the samples become linearly separable (see Fig. 8.26 in Zeppelzauer et al. 2015).

Fig. 8.21
figure 21

Examples of support vector machine hyperplanes. (a) The margin of the hyperplane is not optimal, (b) a hyperplane with a maximized margin. The support vectors are circled

SVMs originally were designed for binary classification, but a number of methods have been developed for applying them to multi-class problems. The three most common methods are: (1) form k binary “one-against-the-rest” classifiers, where k is the number of classes and the class whose decision-function is maximized is chosen (Vapnik 1998), (2) form all k(k − 1)/2 pair-wise binary classifiers, and choose the class whose pair-wise decision-functions are maximized (Li et al. 2002), and (3) reformulate the objective function of SVM for the multi-class case so decision boundaries for all classes are optimized jointly (Guemeur et al. 2000).

Gingras and Fitch (2013) used four different algorithms (SVM, k-nearest neighbor, multivariate Gaussian distribution classifier, and GMM) to classify advertisement calls from four genera of anurans and obtained comparable accuracy levels from all three models. Fagerlund (2007) used SVMs to classify bird sounds produced by several species using decision trees with binary SVM classifiers at each node. The two datasets used by Fagerlund (2007) contained six and eight bird species and correct classification scores were 78–88% and 96–98% for the two datasets, respectively, depending on which variables were used in the classifiers.

Zeppelzauer et al. (2015) and Stoeger et al. (2012) both used SVM to identify African elephant rumbles. Zeppelzauer et al. (2015) used cepstral feature vectors and an SVM to distinguish African elephant rumbles from background noise. This SVM resulted in an 88% correct detection rate and a 14% false alarm rate. In addition to SVM, Stoeger et al. (2012) also used linear discriminant analysis (LDA) and nearest neighbor classification algorithms to categorize two types of rumbles produced by five captive African elephants based on spectral representations of the sounds. They obtained a classification accuracy of greater than 97% for all three classification methods.

Jarvis et al. (2006) developed a new type of multi-class SVM, called the class-specific SVM (CS-SVM). In this method, k binary SVMs are created, where each SVM discriminates between one of the k classes of interest and a common reference-class. The class whose decision-function is maximized with respect to the reference-class is selected. If all decision-functions are negative, the reference-class is selected. The advantage of this method is that noise in recordings is treated as the reference-class. Jarvis et al. (2006) used their CS-SVM to discriminate clicks produced by Blainville’s beaked whales from ambient noise and obtained a correct classification score of 98.5%. They also created a multi-class CS-SVM that classified clicks produced by Blainville’s beaked whales, spotted dolphins (Stenella attenuata), and human-made sonar pings. This CS-SVM resulted in 98% correct classification for Blainville’s beaked whale clicks, 88% correct classification for spotted dolphin clicks, and 95% correct classification for sonar pings. It is important to note that the training data were included in their test data, which likely resulted in inflated correct classification scores.

8.4.3.5 Dynamic Time-Warping

Dynamic time-warping (DTW) is a class of algorithms originally developed for automated human speech recognition (Myers et al. 1980). DTW is used to quantitatively compare time-frequency contours of different durations using variable extension and compression of the time axis (Deecke and Janik 2006; Roch et al. 2007). There are different DTW techniques (e.g., Itakura 1975; Sakoe and Chiba 1978; Kruskal and Sankoff 1983), but all are based on comparing a reference sound to a test sound. The test sound is stretched and compressed along its contour to minimize the difference between the shapes of the two contours. Restrictions can be placed on the amount of time-warping that takes place. For example, Buck and Tyack (1993) did not time-warp contours that differed by a factor of more than 2 in duration and assigned those contours a similarity score of zero. Deecke and Janik (2006) stated that contours could only be stretched or compressed up to a factor of 3 to fit the reference contour. In a DTW analysis, all individual contours are compared to all other contours and a similarity matrix is constructed. Sounds are clustered into categories based on the similarity matrix using methods such as k-nearest neighbor cluster analysis or ANNs (Deecke and Janik 2006; Brown and Miller 2007).

DTW has been used to classify bird sounds. Anderson et al. (1996) applied DTW to recognize individual song syllables for two species of songbirds: indigo buntings (Passerina cyanea) and zebra finches (Taeniopygia guttata). Their analysis resulted in 97% correct classification of stereotyped syllables and 84% correct classification of syllables in plastic song. It is important to note, however, that these results were obtained for song recorded from a single individual of each species in a controlled setting. Somervuo et al. (2006) performed DTW to classify bird song syllables produced by 14 different species. They compared two different methods for computing distance between syllables: (1) simple Euclidean distances between frequency-amplitude vectors, and (2) absolute distance between frequencies weighted by the sum of their amplitudes. Classification accuracy was low, at about 40–50%, depending on the species and the distance method used. They obtained higher classification success using classification methods such as hidden Markov models (HMM) and GMM based on song fragments, rather than on single syllables.

Buck and Tyack (1993) performed DTW to classify three signature whistles from each of five wild bottlenose dolphins recorded in Sarasota, Florida, USA, with 100% accuracy. Deecke and Janik (2006) used DTW to classify signature whistles produced by captive bottlenose dolphins. The DTW algorithm outperformed human analysts and other statistical methods tested by Janik (1999). DTW also was applied to classify stereotypical pulsed sounds produced by killer whales, both in captivity (Brown et al. 2006) and at sea (Deecke and Janik 2006; Brown and Miller 2007). In all of these studies, sounds were classified into categories that were identified perceptually by humans with very high correct classification scores.

Oswald et al. (2021) used dynamic time-warping and neural network analysis to group whistle contours produced by short- and long-beaked common dolphins (Delphinus delphis and D. bairdii) into categories. Many of the resulting categories were shared between the two species, but each species also produced a number of species-specific categories. Random forest analysis showed that whistles in species-specific categories could be classified to species with significantly higher accuracy than whistles in shared categories. This suggests that not every whistle carries species information, and that specific whistle types play an important role in dolphin species identification.

8.4.3.6 Hidden Markov Models

Hidden Markov mode (HMM) theory was developed in the late 1960s by Baum and Eagon (1967) and now is used commonly for human speech recognition (Rabiner et al. 1983, 1996; Levinson 1985; Rabiner 1989). To create an HMM, a vector of features is extracted from a signal at discrete time steps. The temporal evolution of these features from one state to the next is modeled by creating a transition matrix M, where Mij is the probability of transition from state i to state j, and an emission matrix E, where Eis is the probability of observing signal s in state i (Rickwood and Taylor 2008). A different HMM is created for each species in the dataset and a sound is classified by determining which of the HMMs has the highest likelihood of producing that particular set of signal states. Training HMMs requires significant amounts of computing, and proper estimation of the transition and output probabilities is of crucial importance (Makhoul and Schwarz 1995). Excellent tutorials on HMMs can be found in Rabiner and Juang (1986) and Rabiner (1989).

A significant advantage inherent to HMMs is their ability to model time and spectral variability simultaneously (Makhoul and Schwarz 1995). They are able to model time series that have subtle temporal structure and are efficient for modeling signals with varying durations by performing nonlinear, temporal alignment during both the training and classification processes (Clemins et al. 2005; Roch et al. 2007; Trifa et al. 2008). Using HMMs, complex models can be built to deal with complicated biological signals (Rickwood and Taylor 2008), but care must be taken when choosing training samples to obtain a high generalization ability. The performance of an HMM is influenced by the size of the training set, the feature extraction method, and the number of states in the model (Trifa et al. 2008). Recognition performance is also affected by noise (Trifa et al. 2008).

In addition to being successfully implemented in human speech recognition, HMMs have been used to classify the sounds produced by birds (Kogan and Margoliash 1998; Trawicki et al. 2005, Trifa et al. 2008, Adi et al. 2010), red deer (Cervus elaphus; Reby et al. 2006), African elephants (Clemins et al. 2005), common dolphins (Sturtivant and Datta 1997; Datta and Sturtivant 2002), killer whales (Brown and Smaragdis 2008, 2009); beluga whales (Clemins and Johnson 2005; Leblanc et al. 2008), bowhead whales (Mellinger and Clark 2000), and humpback whales (Suzuki et al. 2006). HMMs perform as well as, or better than, both GMMs and DTW (Weisburn et al. 1993; Kogan and Margoliash 1998) and are becoming more common in animal classification studies.

Adi et al. (2010) also used HMMs to examine individually distinct acoustic features in songs produced by ortolan buntings (Emberiza hortulana). They represented each song syllable using a 15-state HMM (Fig. 8.22). These HMMs then were connected to represent song types. The 14 most common song types were included in the analysis and correct classification ranged from 50% to 99%, depending on the song type. Overall, 90% of songs were correctly classified. Adi et al. (2010) used these results to illustrate the feasibility of using acoustic data to assess population sizes for these birds.

Fig. 8.22
figure 22

Example of a 15-state hidden Markov model representation of the waveform of a song syllable produced by an ortolan bunting to capture the temporal pattern of the syllable (Adi et al. 2010). © Acoustical Society of America, 2010. All rights reserved

Reby et al. (2006) used HMMs to examine whether common roars uttered by red deer during the rutting season can be used for individual recognition. They recorded roar bouts from seven captive red deer and used HMMs to model roar bouts as successions of silences and roars. Each roar in the analysis was modeled as a succession of states of frequency components measured from the roars. Overall, the HMM correctly identified 85% of roar bouts to the individual deer, showing that roars were individually specific. Reby et al. (2006) also used HMMs to examine stability in this individuality over the rutting season. They did this by training an HMM using roar bouts recorded at the beginning of the rutting season and testing the model using roar bouts recorded later in the rutting season. Overall, 58% of roar bouts were classified correctly, suggesting that individual identification cues in roar bouts varied over time.

8.5 Challenges in Classifying Animal Sounds

Placing sounds into categories is not always straightforward. Sounds produced by a particular species often contain a great deal of variability caused by different factors (e.g., location, date, age, sex, and individuality), which can make it difficult to define categories. In addition, sound categories are not always sharply demarcated, but instead grade or gradually transition from one form to another. It is important to be aware of the challenges in a particular dataset. Below are some types of variation that can be encountered in the classification of animal sounds.

8.5.1 Recording Artifacts

Bioacousticians need to be aware that recorded animal sounds are affected by the frequency and sensitivity specifications of the recording system used. An inappropriate recording system can result in distorted or partial sounds, which complicates their classification. For example, sounds can be misrepresented in recordings if the frequency response of the recording system is not linear, if the sampling frequency is too low, if sounds exist below or above the functional frequency range of the recording system, or if aliasing occurs (see Chap. 4). Ideally, recording systems should be carefully assembled and calibrated for the specific application. If the effects of the recording system could always be removed completely from recordings, sound classification would be more consistent and comparable. However, sounds published in the literature are sometimes received sounds that were affected by the recorder and/or the sound propagation environment.

One of the most common problems in underwater acoustic recordings is mooring noise. If hydrophones are held over the side of a boat, the recordings will contain sound from waves splashing against the boat or the hydrophone cable rubbing against the boat. Recorders built into mooring lines can record cable strum or clanking chains. If multiple oceanographic sensors are moored together, sounds from other instruments (e.g., wipers on a turbidity sensor) may be recorded. Recorders resting on soft seafloor in coastal water may record the sound of sand swishing over the mooring. In addition, hydrostatic pressure fluctuations from the recorder bouncing in the water column or vortices at the hydrophone if deployed in strong currents will cause flow noise. All of these artifacts can last from seconds to minutes and appear in spectrograms as power from a few hertz to high kilohertz. Minimization of mooring noise and identification of recording artifacts is an art (also see Chaps. 2 and 3).

Similarly, artifacts can be recorded during airborne recordings. Wind is a primary artifact; however, moving vegetation and precipitation can also add noise to a recording. Any disturbance to the microphone can generate unwanted tapping or static on a recording. Recording systems in terrestrial environments need to be secured to minimize such noises.

8.5.2 Sound Propagation Effects

Environmental features of air or water can change the way sound propagates and thus the acoustic characteristics of a recorded sound. Bioacousticians need to understand environmental effects on the features of received sound to avoid classification of a signal variant as a new type, rather than as a particular sound type affected by propagation conditions. The sound propagation environment can affect both the spectral and temporal features of sound as it propagates from the animal to the recorder (see Chaps. 5 and 6). For example, energy at high frequencies is lost (attenuates) very quickly due to scattering and absorption, and therefore high-frequency harmonics do not propagate over long ranges. Acoustic energy at low frequencies (i.e., long wavelengths) does not travel well in narrow waveguides (e.g., shallow water). Because different frequencies within a sound can attenuate at different rates, the same sound can appear differently on a spectrogram, depending on the distance at which it was recorded.

Differential attenuation of frequencies in air is shown in Fig. 8.23. Signals produced by a big brown bat (Eptesicus fuscus) flying toward a microphone contain more ultrasonic components than signals recorded from a bat flying away from the microphone. The signal with the longest frequency modulation (from 100 to 50 kHz) is received when the bat is closest to the microphone. Variations in this spectrogram show how one sound type could be categorized differently simply because of distance between the animal and recorder, orientation to the microphone, and the gain setting.

Fig. 8.23
figure 23

Spectrogram of big brown bat (Eptesicus fuscus) circling a recording device while searching and pursuing aerial prey. As the bat approaches the microphone, more of the ultrasonic signal is received (calls reach up to 70 kHz). As the bat moves away, the signal is attenuated. Time between calls shortens notably as the bat pursues an insect prey for capture. Notice that the bat emits “search” calls at 25–40 kHz, approach calls at 30–70 kHz when it is in pursuit or trying to navigate flight through complex space, and finally terminal calls at 30–55 kHz

Other sound propagation effects include reverberation (which leads to the temporal spreading of brief, pulsed sounds) and frequency dispersion. Frequency dispersion is a result of energy at different frequencies traveling at different speeds. This leads to sounds being spread out in time and, specifically in some underwater environments, can cause pulsed sounds to become frequency-modulated sounds (either up- or downsweeps; Fig. 8.24).

Fig. 8.24
figure 24

Spectrograms of marine seismic airgun signals recorded at three different ranges: 1.5 km (top), 80 km over soft seabed (middle), and 40 km over a hard seabed (bottom). The top and bottom spectrograms are of the same seismic survey. Pulses were brief and broadband near the source, but became frequency-modulated and narrowband some distance away due to dispersion (Erbe et al. 2016). © Erbe et al.; https://ars.els-cdn.com/content/image/1-s2.0-S0025326X15302125-gr9_lrg.jpg. Licensed under CC BY 4.0; https://creativecommons.org/licenses/by/4.0/

Finally, ambient noise (i.e., geophysical noise, anthropogenic noise, and non-target biological noise) superimposes with animal sounds, and at some distances and frequencies, parts of the animal sound spectrum will begin to drop below the levels of ambient noise. As a result, the same animal sound in a different environment and at a different distance from the animal can look quite different on a spectrogram and cause it to be misclassified as two different sound types.

8.5.3 Angular Aspects of Sound Emission

The orientation of an animal relative to the receiver (microphone or hydrophone) can change the acoustic features of the recorded sound. This complicates classification, and off-axis variations of a sound need to be known so they can be categorized as just a variant of a particular sound type, rather than as a new sound type. Not all sounds emitted by animals are omni-directional (i.e., propagate equally in all angles relative to the animal). Au et al. (2012) studied the directionality of bottlenose dolphin echolocation clicks by measuring the horizontal and vertical emission beam patterns of these sounds. The angle at which an echolocation click was recorded relative to the transducer (or echolocating animal) not only affected its received level, but also the waveform and frequency spectrum (Fig. 8.25). Sperm whale (Physeter macrocephalus) echolocation clicks, when recorded off-axis (i.e., away from the center of its emission beam), consisted of multiple complex pulses that were likely due to internal reflections within the sperm whale’s head (Møhl et al. 2003; also see Chap. 12).

Fig. 8.25
figure 25

Waveforms and spectra of a bottlenose dolphin echolocation click in the horizontal (a) and vertical (b) planes (Au et al. 2012). © Acoustical Society of America, 2012. All rights reserved

8.5.4 Geographic Variation

Geographic variation, or differences in the sounds produced by populations of the same species living in different regions, has been documented for many terrestrial and aquatic animals, including Hawaiian crickets (Mendelson and Shaw 2003), Túngara frogs (Engystomops pustulosus, Prӧhl et al. 2006), bats (Law et al. 2002; Aspetsberger et al. 2003; Russo et al. 2007; Yoshino et al. 2008), pikas (Borisova et al. 2008), sciurid rodents (Gannon and Lawlor 1989; Slobodchikoff et al. 1998; Yamamoto et al. 2001; Eiler and Banack 2004), singing mice (Scotinomys spp., Campbell et al. 2010), primates (Mitani et al. 1992; Delgado 2007; Wich et al. 2008), cetaceans (Helweg et al. 1998; McDonald et al. 2006; Delarue et al. 2009; Papale et al. 2013, 2014), and elephant seals (Mirounga spp., Le Boeuf and Peterson 1969). When developing classifiers, it is important to understand the degree of geographic variation in a sound repertoire and the range over which this occurs. If geographic variation exists, then a classifier trained using data collected in one location may not work well when applied to data collected in another location.

One of the underlying causes of geographic variation may be reproductive isolation of a population. Keighley et al. (2017) used DFA with stepwise variable selection to determine geographic variation in sounds from six major populations of palm cockatoos (Probosciger aterrimus) in Australia. Palm cockatoos from the east coast (Iron Range National Park) had unique contact sounds and produced fewer sound types than at other locations. The authors speculated that this large difference was due to long-term isolation at this site and noted that documentation of geographic variation in sounds provided important conservation information for determining connectivity of these six populations.

Thomas and Golladay (1995) employed PCA to classify nine underwater vocalization types produced by leopard seals (Hydrurga leptonyx) at three study sites near Palmer Peninsula, Antarctica. The PCA successfully separated vocalizations from the three study areas and provided information about what features of the sounds were driving the differences among locations. For example, the first principal component was influenced by maximum, minimum, start, and end frequencies, the second principal component was influenced by the presence or absence of overtones, and the third principal component was predominantly related to time relationships, such as duration and time between successive sounds. Note that some sound types were absent at some locations.

8.5.5 Graded Sounds

Some animals produce sound types that grade or gradually transition from one type to another. Researchers should not neglect the potential existence of vocal intermediates in classification. For example, Schassburger (1993) described sounds produced by timber wolves (Canis lupus) as barks, growl-moans, growls, howls moans, snarls, whimpers, whine-moans, whines, woofs, and yelps. Wolves combine these 11 principal sounds to create mixed-sounds that often grade from one type into another.

Clicks trains, burst-pulse sounds, and whistles produced by delphinids are typically considered as three distinct categories of sound. Click trains and burst-pulse sounds are composed of short, exponentially damped sine waves separated by periods of silence, while whistles are generally thought of as continuous tonal sounds, often sweeping in frequency. While these sounds appear quite different from one another on spectrograms, closer inspection of their waveforms reveals that some sounds that look like whistles on a spectrogram actually contain a high degree of amplitude modulation. In other words, some sounds that are considered to be whistles are made up of pulses with inter-pulse intervals that are too short to hear or be resolved by the analysis window of the spectrogram (Fig. 8.26). As an example of this, Murray et al. (1998) used self-organizing neural networks to analyze the vocal repertoires of two captive false killer whales (Pseudorca crassidens) based on measurements taken from waveforms. They found that rather than organizing sounds into distinct categories, the vocal repertoire was more accurately represented by a graded continuum, with exponentially damped sinusoidal pulses on one end and continuous sinusoidal signals at the other. Beluga whales also have been shown to have a graded vocal repertoire (Karlsen et al. 2002; Garland et al. 2015). Whistles with a high degree of amplitude modulation have been recorded from Atlantic spotted and spinner (Stenella longirostris) dolphins (Lammers et al. 2003), suggesting that this graded continuum model is applicable to these species as well.

Fig. 8.26
figure 26

Spectrogram and waveform of a false killer whale vocalization. The vocalization appears to be a whistle in the spectrogram, but the waveform reveals discrete pulses between 61 and 67 ms (Murray et al. 1998). © Acoustical Society of America, 1998. All rights reserved

8.5.6 Repertoire Changes Over Time

Some animal sound repertoires change over time, which complicates their classification. For example, humpback whale song slowly changes over the course of a breeding season as new units are introduced and old ones discarded (Noad et al. 2000). Song also changes from one season to the next, and in one instance, eastern Australian humpback whales changed to the song of the western Australian population within 1 year (Noad et al. 2000).

Antarctic blue whales can be heard off southwestern Australia from February to October every year. The upper frequency of their Z-call decreases over the season by about 0.4–0.5 Hz. At the beginning of the next season, the Z-call jumps in frequency to about the mean of the Z frequency of the previous season, and then decreases again, leading to an average decrease in the frequency of the upper part of the Z-call by 0.135 ± 0.003 Hz/year (Fig. 8.27; Gavrilov et al. 2012). A similar decrease (albeit at different rates at different locations) has been observed for the “spot call,” of which the animal source remains elusive (Fig. 8.27; Ward et al. 2017). The reasons for these shifts are unknown.

Fig. 8.27
figure 27

Weekly means of the upper part of the Antarctic blue whale Z-call over several years, as well as of the spot call, which remains to be identified to species. All locations are off Australia (GAB: Great Australian Bight). Data updated from Gavrilov et al. (2012) and Ward et al. (2017). Courtesy of Sasha Gavrilov

8.6 Summary

Animals, whether they are in air, on land, or under water, produce sound in support of their various life functions. Cicadas join in chorus to repel predatory birds (Simmons et al. 1971); male fishes chorus on spawning grounds to attract females (Amorim et al. 2015); frogs call to attract mates and to mark out their territory (Narins et al. 2006); birds, too, sing for territorial and reproductive reasons (Catchpole and Slater 2008); bats emit clicks for echolocation during hunting and navigating, as do dolphins (Madsen and Surlykke 2013). In order to study animals by listening to their sounds, sounds need to be classified to species, to behavior, etc. In the early days, this was done without measurements or with only the simplest measuring tools. Scientists listened to the sounds in the field, often while visually observing animals. Scientists recorded sounds in the field and analyzed the recordings in the laboratory by listening, looking at oscillograms or spectrograms, and manually sorting sounds into types. Nowadays, with the affordability of autonomous recording equipment, bioacousticians collect vast amounts of data, which can no longer be analyzed without the aid of automated data processing, data reduction, and data analysis tools. Given simultaneous advances in computer hard- and software, datasets may be analyzed more efficiently, and with the added advantage of reducing opportunities for human subjective biases.

In this chapter, we presented software tools for automatically detecting animal sounds in acoustic recordings, and for classifying those sounds. The detectors we discussed compute a specific quantity of the sound (such as its instantaneous energy or entropy) and then apply a threshold above which the sound is deemed detected. The specific detectors were based on acoustic energy, Teager–Kaiser energy, entropy, matched filtering, and spectrogram cross-correlation. Setting the detection threshold critically affects how many signals are detected and how many are missed. We presented two ways of finding the best threshold and assessing detector performance: receiver operating characteristics and precision-recall curves.

Once signals have been detected, they can be classified. A common pre-processing step immediately prior to classification includes the measurement of sound features such as minimum and maximum frequency, duration, or cepstral features. The software tools we presented for classification included parametric clustering, principal component analysis, discriminant function analysis, classification trees, and machine learning algorithms. No single tool outperforms all others; rather, the best tool suited for the specific task needs to be employed. We discussed advantages and limitations of the various tools and provided numerous examples from the literature. Finally, challenges resulting from recording artifacts, the environment affecting sound features, and changes in sound features over time and space were explored.

It is important to remember that human perception of a sound likely is not the same as an animal’s perception of the sound and yet bioacousticians commonly describe or classify animal sounds in human terms. Classification of the acoustic repertoire of an animal into sound types provides a convenient framework for comparing and contrasting sounds, taking systematic measurements from portions of the repertoire, and performing statistical analyses. However, categories determined based on human perception may have little or no relevance to the animals and so human categorizations can be biologically meaningless. For example, humans have limited low-frequency and high-frequency hearing abilities compared to many other species, and so aural classification of sound types is sometimes based on only a portion of a sound audible to the human listener. Whether sound types determined by humans are meaningful classes to the animals is mostly unknown. While categorizing sounds based on function is an attractive approach for the behavioral zoologist, establishing the functions of these sounds is often challenging. In our review of classification methods, it was clear that methods developed for human speech could be applied to animal sounds. Some fascinating questions lie ahead for bioacousticians as they attempt to extend understanding of the perception experienced by other animals.

Even with the above caveats, detection and classification of animal sounds is useful for research and conservation. It allows populations to be monitored, their distribution and abundance to be determined, and impacts (e.g., from human presence or climate change) to be assessed. It can also be useful for conservation of a species (i.e., to create taxonomy, identify geographic variation in populations, examine ecological connectivity among populations, and detect changes in the biological uses sounds due to the advent and growth of anthropogenic noise). Classification of animal sounds is important for understanding behavioral ecology and social systems of animals and can be used to identify individuals, social groups, and populations. The ability to study these types of topics will ultimately lead to a deeper understanding of the evolutionary forces that shape animal bioacoustics.

With a goal to foster wider participation in research on bioacoustic pattern recognition, a number of global competitions are held regularly. The annual Detection and Classification of Acoustic Scenes and Event (DCASE) workshops and BirdCLEF challenges (part of Cross Language Evaluation Forum) attract hundreds of data scientists for developing machine learning solutions for recognizing bird sounds in soundscape recordings. The marine mammal community organizes the biennial Detection, Classification, Localization, and Density Estimation (DCLDE) workshops. These challenges put out large training datasets for researchers to develop detection and classification systems, assess the performance of submitted solutions with “held out” datasets, and reward the top-ranked submissions. The datasets from these challenges are often made available for use by the research community after the competitions, while some workshops make available the submitted solutions as well.

8.7 Additional Resources