Introduction

The timely and accurate detection of animals, birds and insects is of critical importance for conservation, ecology and epidemiology. We consider the effective analysis of the natural soundscape as a constituent component of this analysis. In this paper, we focus on bioacoustic classification, with a particular emphasis on mosquito detection. As part of showcasing the methods developed for this application, we describe how they can also, with minimal alteration, offer robust results in other bioacoustic classification domains.

Mosquitoes are responsible for hundreds of thousands of deaths every year due to their capacity to vector lethal parasites and viruses, which cause diseases such as malaria, lymphatic filariasis, zika, dengue and yellow fever [51, 52]. Their ability to transmit diseases has been widely known for over a hundred years, and several practices have been put in place to mitigate their impact on human life. Examples of these include insecticide-treated mosquito nets [7, 33] and insect sterilization techniques [4]. However, further progress in the battle against mosquito-vectored disease requires a more accurate identification of species and their precise location—not all mosquitoes are vectors of disease, and some non-vectors are morphologically identical to highly effective vector species. Current surveys rely either on human-landing catches or on less effective light traps. In part, this is due to the lack of cheap, yet accurate, surveillance sensors that can aid mosquito detection. Acoustic monitoring of mosquitoes proves compelling, as the insects produce a sound both as a by-product of their flight and as a means for communication and mating. Detecting and recognizing this sound is an effective method to locate the presence of mosquitoes and even offers the potential to categorize by species. Nonetheless, automated mosquito detection presents a fundamental signal processing challenge, namely the detection of a weak signal embedded in noise. Current detection mechanisms rely heavily on domain knowledge, such as tuning models to likely fundamental frequency and harmonics, and often extensive handcrafting of features, frequently similar to traditional speech representation methods. Over the last decade, there have been increasingly impressive performance gains achieved by the paradigm shift to deep learning, including bioacoustics [29]. An opportunity hence emerges to exploit and expand upon these advances to tackle our application problem.

Deep learning approaches, however, tend to be effective only once a critical number of training samples has been reached [9]. Consequently, data-scarce problems are not well suited to this paradigm. As with many other domains, the task of data labelling is expensive in both time requirement for hand labelling and associated ambiguity—namely that multiple human experts will not be perfectly concordant in their labels. Furthermore, recordings of free-flying mosquitoes in realistic environments are scarce [37] and hardly ever labelled.

This paper presents a novel approach for classifying events from acoustic data using scarce training data. Our approach is based on a convolutional neural network classifier conditioned on wavelet representations of the raw data. By exploiting the high sample rates of audio recordings, we are able to create sufficient training data for deep learning to remain highly effective. The network architecture and associated hyperparameters are, however, still strongly influenced by constraints in dataset size. We compare our methods to well-established classifiers, trained on both handcrafted features and the short-time Fourier transform (STFT), as well as human-made labels of mosquito audio recordings. We show that wavelet-conditioned CNN classifications are consistently made more accurately and confidently than on the STFT. The majority of our algorithms are able to more reliably detect mosquitoes (with accuracy above 90%) than human labellers, where only 70% of labels are in full agreement amongst four labellers.

Furthermore, without additional hyperparameter tuning we demonstrate that our approach scales well to different data domains, a transfer that traditional handcrafted features or classifiers struggle to make. Highlighting the generic nature of the solution we propose, we show that the CNN is also able to extract feature representations that allow it to distinguish between nine species of birds with reliably high accuracy (over 90%), similarly from very little data.

The remainder of this paper is structured as follows. Section 2 addresses related work, explaining the motivation and benefits of our approach. Section 3 details the method we adopt, providing insight into the relative strengths of wavelet transforms. Section 4 describes the experimental setup. Section 5 highlights the value of the method. We visualize and interpret the predictions made by our algorithm on unseen data in Sect. 6 to help reveal informative features learned from the representations and verify the method. Finally, we suggest further work and conclude in Sect. 7.

Related work

The use of artificial neural networks in acoustic detection and classification of species dates back to at least the beginning of the century, with the first approaches addressing the identification of bat echolocation calls [40]. Both manual and algorithmic techniques have subsequently been used to identify insects [10, 53], elephants [15], delphinids [39] and other animals. The benefits of leveraging the sound animals produce—both actively as communication mechanisms and passively as a result of their movement—is clear: animals themselves use sound to identify prey, predators and mates. Sound can therefore be used to locate individuals for biodiversity monitoring, pest control, identification of endangered species and more.

This section will therefore briefly review the use of machine learning approaches in bioacoustics. We describe the traditional feature and classification approaches to acoustic signal detection. In contrast, we also present the benefit of feature extraction methods inherent to current deep learning approaches. Finally, we narrow our focus down to the often overlooked wavelet transform, which offers significant performance gains in our pipeline.

Applications

The employment of artificial neural networks has proven successful for over a decade. In Chesmore and Ohya [10], a neural network classifier was used to discriminate four species of grasshopper recorded in northern England, with accuracy surpassing 70%. Other classification methods include Gaussian mixture models [41, 44] and hidden Markov models [34, 53], applied to a variety of different features extracted from recordings of singing insects. The work of Chen et al. [9] attributes the stagnation of automated insect detection accuracy to the sole use of acoustic devices, which are often not capable of producing a signal sufficiently clean to be classified correctly. In their work, they replace microphones with optical sensors, recording mosquito wingbeat through a laser beam hitting a phototransistor array—an extension of the method proposed by Moore et al. [36]. In a real-world setting, the resultant signals have a higher signal-to-noise ratio than those recorded acoustically. We regard these approaches and acoustic sensors as complementary, rather than competitors, and note that approaches which work well for acoustic detection can also be used to perform detection in other datasets, including optically sensed data, as well as other bioacoustic problems.

Whichever technique is used to record a mosquito wingbeat frequency, the need arises to be able to identify the insect’s flight in a (more or less) noisy recording. The following section therefore reviews recent achievements in feature representation and learning, in the broad context of practical acoustic signal classification.

Feature representation and learning

The process of automatically detecting an acoustic signal in noise typically consists of an initial preprocessing stage, which involves cleaning and de-noising the signal itself, followed by a feature extraction process, in which the signal is transformed into a format suitable for a classifier, followed by the final classification stage. Historically, audio feature extraction in signal processing employed domain knowledge and intricate understanding of digital signal theory [28], leading to handcrafted feature representations.

Many of these representations often recur in the literature. A powerful, though often overlooked, technique is the wavelet transform, which has the ability to represent multiple time-frequency resolutions [2, Chapter 9]. An instantiation with a fixed time-frequency resolution thereof is the Fourier transform. The Fourier transform can be temporally windowed with a smoothing window function to create a short-time Fourier transform (STFT). Mel-frequency cepstral coefficients (MFCCs) create lower-dimensional representations by taking the STFT, applying a nonlinear transform (the logarithm), pooling and a final affine transform. A further example is presented by linear prediction cepstral coefficients (LPCCs), which pre-emphasise low-frequency resolution and thereafter undergo linear predictive and cepstral analysis [1].

Detection methods have fed generic STFT representations to standard classifiers [42], but more frequently complex features and feature combinations are used, applying dimensionality reduction to combat the curse of dimensionality [32]. Complex features (e.g., MFCCs and LPCCs) were originally developed for specific applications, such as speech recognition, but have since been used in several audio domains [35]. Humphrey et al. [28] argue that using features specifically developed for a prior application is unsustainable and has contributed to the stagnation in the field of audio event recognition.

On the contrary, the deep learning approach usually consists of applying a simple, general transform to the input data and allowing the network to both learn a feature representation and perform classification. This enables the models to learn salient, hierarchical features from raw data. The automated deep learning approach has recently featured prominently in the machine learning literature, showing impressive results in a variety of application domains, such as computer vision [31] and speech recognition [32]. However, deep learning models such as convolutional and recurrent neural networks are known to have a large number of parameters and hence typically require large data and hardware resources. Despite their success, these techniques have only recently received more attention in the time series signal processing literature.

A prominent example of this shift in methodology is the BirdCLEF bird recognition challenge. The challenge consists of the classification of bird songs and calls into up to 1500 bird species from tens of thousands of crowd-sourced recordings. The introduction of deep learning has brought drastic improvements in mean average precision (MAP) scores. The best MAP score of 2014 was 0.45 [23], which was improved to 0.69 the following year when deep learning was applied, outperforming the closest scoring handcrafted method by 19% [29]. The impressive performance gain came from the utilization of well-established convolutional neural network practice from image recognition. By transforming the signals into STFT spectrogram format, the input is represented by 2D matrices, which are used as training data. The following year saw a further jump to 0.71 [46] by utilizing transfer learning of the Inception-v4 deep convolutional neural network which was highly successful in ImageNet. Alongside this example, the most widely used base method to transform the input signals is the STFT [26, 43, 45].

An alternative feature transformation can be obtained with wavelets. Gaining popularity in the late 1990s, wavelets have been applied successfully to efficient image compression [12, JPEG 2000], de-noising [20], and have shown an ability to form efficient multi-resolution representations [17]. These properties have led to the use of wavelets in deep learning in two general ways. In one, wavelets are used as a preprocessing step to form noise-robust representations of time series, while in the second wavelets are employed to replace neurons to form wavelet neural networks. An example application of the former used Haar wavelets for stock price time series forecasting with recurrent neural networks [27]. In the latter scenario, wavelet neural networks have seen some success in time series prediction [8], signal classification and compression [30], but a lack of standard representations and general frameworks has prevented wider adoption [3].

As a result, to the best of our knowledge, the wavelet transform is rarely used as the representation domain for a convolutional neural network. In the following section, we present our method, which leverages the benefits of the wavelet transform demonstrated in the signal processing literature, as well as the ability to form hierarchical feature representations for deep learning.

Methods

We present a novel wavelet transform-based convolutional neural network architecture for the detection of events in noisy audio recordings. As our results of Sect. 5 indicate superior performance when training on wavelet representations of the data, we describe in depth the wavelet transform to provide insight into its benefits over the conventional STFT. We explain the wavelet transform in the context of the algorithm, thereafter describing the neural network configurations and a range of traditional classifiers against which we assess performance. The key steps of the feature extraction and classification pipeline are given in Algorithm 1.

figure a

The wavelet transform

We begin by discussing the details of the transform used in Step 2 of Algorithm 1 as a base to further extract features. A well-established approach in signal processing is the Fourier transform, which can be used to express any signal with an infinite series of sinusoids and cosines. Its main disadvantage is the provision only of frequency resolution, meaning one can identify all the frequencies present in a signal, but not their occurrence in time. To overcome this, common approaches include cutting the signal into sections of time and treating each segment separately. This action, however, smears out frequencies, especially in the case of short windows. A wide window is able to provide better frequency resolution at the sacrifice of time resolution. Choosing a window function therefore limits one to a fixed time-frequency resolution. The uncertainty in time-frequency is referred to as the Heisenberg-Gabor limit [6] which is derived from the notion that the product of the precision in time and frequency is limited.

The wavelet transform employs a fully scalable modulated window which provides a principled solution to the windowing function selection problem [49]. The window is slid across the signal, and for every position a spectrum is calculated. The procedure is then repeated at a multitude of scales, providing a signal representation with multiple time-frequency resolutions. This allows the provision of good time resolution for high-frequency events, as well as good frequency resolution for low-frequency events, which in practice is a combination best suited to real signals.

We choose to use the continuous wavelet transform (CWT) due to its successful application in time-frequency analysis [18]. The CWT is particularly well suited over the discrete wavelet transform to time-frequency analysis as redundancy makes information available in peak shape and peak composition more visible and easier to interpret [21]. The CWT can be written in the time domain as:

$$\begin{aligned} a(s,\tau ) = |s|^{-1/2} \int _{- \infty }^{\infty } f(t) \psi ^*\left( \frac{t-\tau }{s}\right) {\text{d}}t, \end{aligned}$$
(1)

or equivalently in the frequency domain as:

$$\begin{aligned} a(s,\tau ) = |s|^{1/2}\int _{- \infty }^{\infty } F(\omega ) \varPsi ^*({s \omega }) e^{i\omega \tau }{\text {d}}\omega , \end{aligned}$$
(2)

where s is the scale factor, \(\tau\) is the translation factor, \(|s|^{-1/2}\) is the energy normalization factor, and * denotes complex conjugation. The wavelets are generated by scaling and translating a single mother wavelet \(\psi (t)\). Through continuous dilation in \(\tau\), the resulting CWT coefficients \(a(s,\tau )\) can be assembled for a multitude of scales to either reconstruct the signal with an inverse transform or to create a spatial representation, called the scalogram. An equivalent representation in Fourier space requires the continuous application of 1-D Fourier transforms with windows that are translated in time. We can illustrate this by substitution of \(\psi ^*_{s,\tau }(\frac{t-\tau }{s}) = e^{-i \omega t}\) in Eq. 1. Essentially, this is equivalent to using a fixed basis with \(s=1\) and ignoring the dilation in \(\tau\). We thereby emphasise the more principled solution employed with the CWT, eliminating the need to choose and parameterize the window function necessary for STFT representations. Furthermore, working with the CWT one is free to choose a wavelet function with properties and characteristics that best suit the data, given knowledge of the signal being analysed. A popular choice of wavelet function for time-frequency analysis is given by the bump wavelet [50], expressed in the Fourier domain as:

$$\begin{aligned} \varPsi (s\omega ) = \exp \left( 1 - \frac{1}{1 - (s\omega - \mu )^2/\sigma ^2}\right) \mathbb {I}[(\mu - \sigma )/s, (\mu + \sigma )/s], \end{aligned}$$
(3)

where \(\mathbb {I}[\cdot ]\) is the indicator function. Valid values for \(\mu , \sigma\) are [3, 6], [0.1, 1.2], respectively. Smaller values of \(\sigma\) result in a wavelet function spanning a narrower frequency bandwidth (Fig. 1a), which results in superior frequency localization but poorer time localization. The bump wavelet is symmetric in frequency and has a direct relationship between wavelet scale and centre frequency, which we illustrate in Fig. 1b. As a result, we can create spectrograms in frequency which retain clear interpretability (which becomes important for Sects. 4, 6).

The spatial features thus created are then passed to the classifiers in the next step of the algorithm. We discuss neural network and more traditional implementations separately in the upcoming sections.

Fig. 1
figure 1

Illustration of key properties of the bump wavelet, constructed from Eqs. 2 and 3. a Bump mother wavelets of fixed scale, \(s_{10}\), with varying values of \(\mu\), \(\sigma\), constructed from Eq. 3. b By converting wavelet scale to frequency, \(f = (\frac{1}{s}\frac{\mu }{2\pi })\), we can illustrate the tiling of the frequency plane with the bump wavelet

Neural network configurations

In this subsection, we start by providing definitions for the layers and parameters used in our convolutional neural network model. Thereafter, we describe how they were used in experimental setting.

A convolutional layer \(H_{\text{conv}}{:}\;{\mathbb {R}}^{h_1\times w_1 \times c} \rightarrow {\mathbb {R}}^{h_2\times w_2 \times N_k}\) with input tensor \(\mathbf {X} \in {\mathbb {R}}^{h_1\times w_1 \times c}\) and output tensor \(\mathbf {Y} \in {\mathbb {R}}^{h_2\times w_2 \times N_k}\) is given by the sequential application of \(N_k\) learnable convolutional kernels \(\mathbf {W}_{p} \in {\mathbb {R}}^{k\times k}, p < N_k\) to the input tensor. Given our single-channel \((c=1\)) input representation of the signal \(\mathbf {X} \in {\mathbb {R}}^{h_1\times w_1 \times 1}\) and a single kernel \(\mathbf {W}_{p}\), their 2D convolution \(\mathbf {Y}_k\) is given by [24, Chapter 9]:

$$\begin{aligned} \mathbf {Y}_k(i, j) = \mathbf {X}*\mathbf {W}_p = \sum _{i'}\sum _{j'}\mathbf {X}(i-i',j-j')\mathbf {W}_p(i',j'). \end{aligned}$$
(4)

The \(N_k\) individual outputs are then passed through a nonlinear function \(\phi\) and stacked as a tensor \(\mathbf {Y}\). Conventional choices for the activation \(\phi\) include the sigmoid function, the hyperbolic tangent and the rectified linear unit (ReLU).

The data size constraint results in an architecture choice (Fig. 2) of few layers and free parameters. Our network consists of an input layer connected sequentially to a single convolutional layer and a fully connected layer, which is connected to the two output classes with dropout [48] with probability p. ReLU activations are employed based on their desirable training convergence properties [31]. Finally, we perform grid search over potential candidate hyperparameters using tenfold cross-validation on a subset of the mosquito training data. We show the results of these in Sect. 4.2. The combination of cross-validation and dropout helps avoid overfitting to our scarce data environment. This is shown by the excellent performance transfer with no hyperparameter re-tuning in Sect. 5.

Fig. 2
figure 2

The CNN pipeline. 1.5 s spectrogram of mosquito recording is partitioned into images with \(c=1\) channels, of dimensions \(h_1 \times w_1\). This serves as input to a convolutional network with \(N_k\) filters with kernel \(\mathbf {W}_p \in {\mathbb {R}}^{k\times k}\). Feature maps are formed with dimensions reduced to \(h_2 \times w_2\) following convolution. These maps are fully connected to \(N_d\) units in the dense layer, fully connected to 2 units in the output layer

Traditional classifier baseline

As a baseline, we compare the neural network models with more traditional classifiers that typically require explicit feature design. We choose three candidate classifiers widely used in machine learning with audio: random forests (RFs), naïve Bayes’ (NBs) and support vector machines using a radial basis function kernel (RBF-SVMs). Their popularity stems from ease of implementation, reasonably quick training and competitive performance [47], especially in data-scarce problems. For brevity, we present results with the best-performing of these only, namely the SVM.

We selected ten features to encode the observed raw data: STFT spectrogram slices with 256 coefficients (created with a Hanning window and 256 samples of overlap), 13 MFCCs, entropy, energy entropy, spectral entropy, flux, roll-off, spread, centroid, and the zero crossing rate (for a detailed explanation of these features, see for example the open-source audio signal analysis toolkit by Giannakopoulos [22]). We note that our choice of feature parameters is based on past literature [19, STFT], [38, MFCCs], as well as empirical evidence. Prior parameterization of the feature space is necessary to some extent, as the number of feature and classifier parameters grows combinatorially to the point where joint optimization of all possible variables is infeasible. We select certain aspects of classifier-feature pipelines by cross-validation as detailed in Sect. 4.2.

Experimental details

Datasets

The mosquito data used here were recorded in January 2016 within culture cages containing both male and female Culex quinquefasciatus. Mosquito wingbeat sounds commonly have a fundamental frequency in the range of 150–750 Hz [14]. In noisy recording conditions, higher harmonics are less audible due to the sharper fall-off of shorter wavelength waves. Furthermore, the signals are sampled with inexpensive smartphone microphones to allow widespread deployment at low cost. Given the quality of these microphones, we observe empirically that sound emitted by mosquitoes mostly disappears in noise for frequencies higher than the third harmonic. We therefore choose to sample at \(F_s = 8\) kHz. Figure 3 shows a frequency domain excerpt of a particularly faint recording in the windowed frequency domains. For comparison, we also illustrate the wavelet scalogram taken with the same number of scales as frequency bins, \(h_1\), in the STFT. We plot the logarithm of the absolute value of the derived coefficients against the spectral frequency of each feature representation. Figure 3 (lower) shows the classifications within \(y_i = \{0,1\}\): absence, presence of mosquito, as labelled by four individual human researchers. These labels are created in version 2.2.2 of Audacity [5] with access to the recording audio and a matching spectrogram visualization. Of these, one particularly accurate label set created with great care under ideal conditions is taken as a gold standard reference to both train the algorithms and benchmark with the remaining experts. The classifications are restricted to only 2 classes due to the absence of labelled data in a multi-species scenario.

Fig. 3
figure 3

STFT (top) and wavelet (middle) representations of signal with \(h_1 = 256\) frequency bins and wavelet scales, respectively. Corresponding varying class labels (bottom) as supplied by human labellers. The wavelet representation shows greater contrast in horizontal constant frequency bands that correspond to the mosquito tone

Cross-validated parameter search

In this section, we describe the experiment design and choice of hyperparameters, optimized to maximize \(F_1\) score over a cross-validation sample. The available 57 mosquito recordings were split into 50% training and 50% held out data. The training data was then further split tenfold to perform cross-validation, creating approximately 3000–30,000 training samples, for window widths \(w_1 = 10\) and \(w_1 = 1\) samples, respectively. The neural networks were trained with a batch size of 256 for 20 epochs, according to validation accuracy in conjunction with early stopping criteria.

For fair comparison, we partition the data (choose window length w) to the strengths of each individual classifier. When evaluating cross-validation performance over the label interval, our hyperparameter optima (\(w_1 = 10\) for the CNN, and \(w_1=1\) for the SVM) given in bold in Table 1 suggest stacking the windows together creates feature vectors that lead to performance degradation for the SVM. Therefore, for each CNN input image with height \(h_1 = 256\) and width \(w_1 = 10\), our SVM will use 10 training samples with \(w = 1, h = 256\) instead. Optimum window widths may vary with the dynamics of the signal, so it is important to consider this parameter systematically. In particular, should significantly more data be available, the drawbacks due to the use of larger windows (decrease in training samples the classifier sees) would be mitigated by the higher number of training samples at disposal. Conversely, the advantage of longer windows lies with supplying a longer temporal context.

The traditional classifiers are cross-validated with principal component analysis (PCA), and recursive feature elimination [25, RFE], with the number of components controlled by n and m, respectively. The best-performing feature set for all traditional classifiers is the set extracted by cross-validated RFE, outperforming all PCA reductions for every classifier-feature pair. The highest scoring hyperparameter, \(m=27\), defines a feature set, that we denote as \(\text{RFE}_{88}\), which retains 88 dimensions from the ten original features spanning 304 dimensions (\(F_{10} \in {\mathbb {R}}^{304}\)).

Table 1 Results for grid search over hyperparameter values

Computational details

We consider computational complexity by splitting the pipelines into three processing stages: feature transformation, classifier training and classifier prediction. The overall compute time is thus the sum of all three. We further break this down for individual pipelines, noting that various software libraries can differ significantly in processing time for the same transformation.Footnote 1 We offer some insights from the figures given in Table 2 in this subsection.

The native training complexity of the RBF-SVM is stated as \(O(n_{\rm SV}d)\), where d is the input dimension and \(n_{\rm SV}\) is the number of support vectors [13]. Table 2 shows that an increase in d coupled with the already large number of training samples (leading to a large \(n_{\rm SV}\)) causes a significant slowdown in both training and prediction with the SVM. A feature dimension reduction, as encountered with the MFCC or RFE approaches, while slightly more costly as a preprocessing step, speeds up the training and prediction significantly.

The CNN was trained in Keras [11], with an NVIDIA 970 GTX GPU. This allows quick training and prediction, resulting in much shorter computation times than those of the scikit-learn SVM (running on a CPU) when working with a large feature space.

Furthermore, the CWT is highly redundant and so incurs a greater computational cost. Its computational complexity increases linearly with number of wavelet scales (provided sufficient RAM). Despite this, the sum of feature transformation and training time is well under the length of the audio recordings, suggesting real-time detection to be perfectly feasible given appropriate hardware. A significant reduction in the CWT processing time can be achieved by calculating each wavelet scale in parallel, due to the independence of each computation per scale. Further considerable speed-up can be achieved by utilizing a discrete wavelet transform or a fast wavelet transform [16]. While not the focus of this paper, this may be worth considering when transferring algorithms to embedded devices and specialized hardware in future work.

Table 2 Execution time given in seconds for feature-classifier pipelines trained on 15 min (900 s) and evaluated over 15 min of audio data sampled at 8000 Hz

Classification performance

The performance metrics are defined at the resolution of the supplied label interval (0.1 s granularity) and presented in Table 3 for the mosquito dataset, and in Table 4 and Fig. 4 for the BirdCLEF subset. We highlight three key results in both applications.

Mosquito detection

Firstly, both the traditional and deep learning algorithms accurately and reliably detect mosquitoes, far surpassing human labellers in both \(F_1\) score and precision-recall (PR) area. Since human labels were supplied as absolute (either \(\hat{y}_i = 1, \hat{y}_i = 0\)), an incorrect label incurs a large penalty on precision-recall curve areas, explaining the large PR area deficit attributed to human labelling.

Secondly, the CNN provides a consistent performance boost with every feature combination, even for the features specifically handcrafted for the use with SVMs (RFE88).

Finally, we note that the wavelet pipeline strongly outperforms the STFT, with both the CNN and SVM.

Bird classification

We now make three observations from Fig. 4 and Table 4, representing a scenario that is novel to the classifier pipelines.

Fig. 4
figure 4

BirdCLEF subset: boxplots of mean accuracy per class (\(F_1\) score) for \(n=30\) trials of the CNN and SVM methods, grouped by feature combination

Firstly, the wavelet features provide the best performance with both the CNN and the SVM, with the top result achieved by the CNN wavelet pipeline. As with the prior application, the wavelet significantly outperforms the STFT with all classifiers, with the difference magnified in this application.

Secondly, the downside to the elaborate hand-tuned feature selection scheme (RFE88) quickly becomes evident when comparing performance conditioned on these features (with \(F_1\) scores of approximately 0.85) to the results of either general deep learning configuration (with \(F_1\) scores of 0.91 and 0.93 for the STFT and wavelet, respectively). We find results are consistent with claims made about the unsustainable nature of handcrafted feature and classifier design [28].

Table 3 Mosquito detection: summary classification metrics reported as means ± the standard deviation from \(n=30\) random hold out dataset splits with 50% training data, and 50% test data
Table 4 BirdCLEF subset: summary classification metrics reported as means ± the standard deviation from \(n=30\) random dataset splits with 50% training data, and 50% test data

Finally, the CNN performs significantly better with high-dimensional, generalizable, features (STFT and wavelet) in this more difficult problem.

Visualizing discriminative power

In the absence of data labels, visualizations can be key to understanding how neural networks obtain their discriminative power. To ensure that the characteristics of the signal have been learnt successfully, we compute the frequency spectra \(\mathbf {x}_{i,\text{test}}(f)\) of samples that maximally activate the network’s units. We compare this to the training spectra \(\mathbf {x}_{i,\text{train}}(f)\) using Algorithm 2.

Figures 5 and 6 show that the test samples closely resembling the training set cause the highest activations—a property we expect from our algorithms to verify they have successfully been trained. Furthermore, Fig. 5 shows that our prior expectation for the mosquito class matches the spectral content that triggers the most confident predictions. This is in the form of a distinct frequency peak around 660 Hz and its harmonic at 1325 Hz, which differs significantly from the noise class. Similarly, Fig. 6 shows unique spectral regions dedicated to each species, also with significant deviation from the noise class.

As we chose a wavelet basis with a scale directly proportional to a centre frequency, we can directly compare spectral representations with the STFT. The wavelet representation results in the more easily distinguishable peaks in the mosquito class (Fig. 5), and overall smoother spectral representations of the bird calls (Fig. 6). We note that a mismatch between high-scoring test and labelled spectra (or matches in non-information bearing regions of the spectrum) may suggest the network could be learning to detect the noise profile of the microphones used for data collection rather than the sound emitted by the object of interest.

Fig. 5
figure 5

Culex mosquito dataset: plot of normalized feature coefficient against STFT frequency bin (a), and wavelet centre frequency (b), for the 10% most confident predicted outputs over a test dataset. The learned spectra \(\mathbf {x}_{i,\text{test}}(f)\) for the highest N scores closely match the labelled class spectra \(\mathbf {x}_{i, \text{train}}(f)\)

Fig. 6
figure 6

BirdCLEF subset: plot of normalized feature coefficient against STFT frequency bin (a) and wavelet centre frequency (b), for the 10% most confident predicted outputs over a test dataset. The learned spectra \(\mathbf {x}_{i,\text{test}}(f)\) closely match the labelled class spectra \(\mathbf {x}_{i, \text{train}}(f)\)

figure b

Conclusions

This paper presents a novel approach for acoustic classification in a real-world, data-scarce scenario. We are able to more accurately and reliably differentiate between the presence and absence of a mosquito than human labellers. Furthermore, we show that a CNN outperforms generic classifiers such as support vector machines commonly used in the field.

Moreover, we highlight the importance of the generality of deep learning approaches by evaluating classification performance over a 10 class subset of bird species recordings, where the wavelet-trained CNN outperforms traditional classification algorithms with no hyperparameter re-tuning of either approach. The consistent improvement observed with wavelet features over the short-time Fourier transform serves to warrant further research on whether the STFT is the correct choice to use as a base transform, as is overwhelmingly used in the literature.

Finally, our generic feature transform allows us to visualize the learned class representation by back-propagating predictions made by the network. We thus verify that the network correctly infers the frequency characteristics of the signal, rather than a peculiarity of the recording such as the microphone noise profile. As more data becomes available, future work will aim to deploy our algorithm in a physical device to allow for large-scale bioacoustic classification.