1 Introduction

Soundscape ecology, which studies the relationships between the landscape and its sounds at multiple temporal and spatial scales [1, 2], has been used to understand ecological processes [3, 4] and impulse conservation plans to mitigate biodiversity loss [5,6,7,8]. However, passive acoustic monitoring (PAM) requires analyzing audio recordings from tens/hundreds of devices, bringing in several processing and interpretability challenges [9]. Consequently, a growing need exists to compute features representing the soundscape that are useful to assess landscape characteristics [10] that can be applied to, for example, quantifying landscape degradation [11, 12].

To address these challenges, ecological acoustic indices (EAI) have been developed to encode complex and biologically relevant information in a single value by analyzing sound properties, including intensity, frequency, and temporal and spectral patterns, which ease and speed up the process of large volumes of data [13, 14]. These indices, along with statistical and machine learning methods, have been widely used to elucidate ecosystem attributes such as wildlife presence [15], bioacoustic activity [16], habitat quality [17], transformation level [18], or to predict habitat type and vegetation structure [19]. However, these indices are sensitive to low signal-to-noise ratios in the recording process [20] and noise (such as geophony or anthropophony) masking sounds of interest [21]. Moreover, several studies use the same indices with contradictory results when correlated with biological variables [22].

In contrast to hand-crafted features such as EAI, deep learning-based approaches exploit data structure to learn representations of the acoustic recordings [23,24,25], which helps these models adapt to several classification tasks [26,27,28,29]. However, a large amount of labeled data is usually required for these models to learn suitable data representations. Moreover, manual annotations from experts are scarce in soundscape studies, for which data-driven alternatives must be developed [30, 31]. Consequently, transfer learning emerged as the alternative to applying deep learning methods in acoustic recordings due to its inherent ability to adapt knowledge from one or several source domains to a target domain [28, 32, 31].

Among deep learning methods, convolutional neural networks (CNNs) are particularly notable for their ability to automatically learn features in their convolutional layers, making them highly effective for acoustic analysis in ecoacoustics. For instance, CityNet [33] utilizes CNNs to quantify biotic and anthropogenic sounds in urban environments, while BirdNet [34], based on the ResNet architecture, specializes in identifying bird species from their sounds. Another prominent example is VGGish, a well-established pretrained neural network renowned for its capability in audio processing. Specifically designed for transfer and zero-shot learning, VGGish generates compact yet informative deep embeddings of audio signals that capture essential acoustic features [35]. These embeddings are helpful for various audio processing tasks such as audio classification, content-based retrieval, and acoustic scene understanding, with applications demonstrated in numerous studies [36, 37]. Recently, VGGish features have also been investigated for their potential as ecological indicators, further expanding their applicability in environmental research. In [38], VGGish deep embeddings are used in both supervised and unsupervised tasks as a universal feature set, yielding ecological insight across spatiotemporal scales and enabling the detection of anomalies. Other studies used transfer learning on VGGish as a feature extraction method to identify bird species [39], avian richness [40], sound components [41], and to discriminate between marine mammal species [42].

Another aspect to consider in PAM studies is including the temporal domain. In this regard, authors have observed how some times of the day offer more relevant acoustic information than others [21]. For example, some taxonomic groups modulate their acoustic activity dynamics with the diel cycle [43]. Moreover, in [29], the authors found a link between soundscape components across time. They found that different patterns may be revealed by analyzing acoustic features on finer time scales. From a deep learning perspective, several studies have utilized features extracted from pretrained models to analyze spatiotemporal patterns in ecosystems. For instance, [38] explores the use of VGGish-based features to discern seasonal or daily patterns, employing supervised models trained on these features. Additionally, [41] leverages unsupervised learning models to classify sound component types from VGGish-based features, which are then used to detect spatiotemporal variations in ecosystems. Despite these advances demonstrating the significance of incorporating temporal information in audio analysis, these studies typically apply such temporal data during post-processing rather than directly integrating it into the learning process. This approach prevents machine learning models from fully utilizing this valuable information. Therefore, developing models or architectures that incorporate and exploit these temporal patterns directly during the learning process is crucial for enhancing performance and facilitating interpretability in machine learning applications.

Additionally, these models must be interpretable. For EAI, each index is designed to underscore specific attributes inherent to the landscape. Conversely, the interpretability of deep embeddings often presents a challenge, as patterns across the training data influence them. This holds even for non-related soundscape recordings when transfer learning is employed [44]. Consequently, establishing a correlation between not biologically based features and ecosystem outcomes becomes complex. To cope with this issue, class activation mapping (CAM) techniques have been employed in sound classification or detection tasks to examine the effectiveness of the constructed feature maps and visualize which temporal or frequency regions play a significant role in the network decision [45,46,47].

In this study, we present a model architecture that enables machine learning techniques to directly incorporate temporal information during the learning process. This approach overcomes the limitation of including temporal information in a post-processing stage. We hypothesize that by feeding machine learning models with temporal blocks—representing the daily dynamics of soundscapes—we can significantly improve model performance in identifying ecological patterns and facilitate their interpretability. The proposed methodology is structured into three main steps: Initially, we extract deep learning-based features from the audio recordings at each recording site and organize these into 24-hour temporal blocks to capture the site’s daily dynamics. These temporal blocks are then used to train supervised models, leveraging the temporal information to classify the ecological patterns at the locations where each audio device is situated based entirely on its audio recordings. Additionally, we employ an interpretability technique to identify the most significant temporal patterns within the blocks, pinpointing specific times of day that enhance the model’s ability to differentiate among classes.

Our study focuses on PAM data (a rich source of ecological information) from two tropical dry forest sites. Each site is equipped with multiple recorders arranged in a grid, providing comprehensive coverage of the study area. In the first study site, recorders were positioned in areas characterized by one of three transformation levels: low, medium, or high. In the second study site, the recorders were located in areas classified into cover classes such as forest, savanna, or pasture. This diverse and extensive dataset allows us to test the effectiveness of our proposed model architecture in a variety of ecological contexts.

To test our approach, we compare it with a version trained with EAI features and supervised learning techniques such as random forest and a feed-forward neural network. As a result, we demonstrate that 24-hour temporal blocks allow exploiting acoustic information to identify transformation/coverage levels in both ecosystems. Moreover, the network trained with deep learning-based features also outperforms the models trained with EAI. Thus, the CAM technique provides the level of relevance of each time of the day when performing the classification.

This paper is structured as follows: the materials and methods are presented in Sect. 2; the experimental setup is presented in Sect. 3; followed by the results in Sect. 4 and the discussion in Sect. 5 which are concluded in Sect. 6.

2 Materials and methods

2.1 Study sites

2.1.1 Caribbean dataset

The study area encompasses tropical dry forest ecosystems situated in the departments of La Guajira and Bolivar in the Colombian Caribbean region, specifically within the Arroyo and Cañas River basins. Twenty-four automatic recording units were located around 11° 11′ 11.0″ N, 73° 26′ 32.9″ W for the Guajira department and around 9° 56′ 23.8″ N, 75° 10′ 06.4″ W in Bolivar (see Fig. 1). High levels of endemism characterize these sites and exhibit considerable rainfall variability. They are located at elevations ranging from 0 to 1000 m and experience dry periods lasting between 3 and 6 months.

Fig. 1
figure 1

Geographical location of passive acoustic monitoring devices in the Caribbean region of Colombia. Pink circle-2015, Blue circle-2016 and Red circle-2017

Acoustic recordings were acquired between December 2015 and March 2017 by the Alexander von Humboldt Institute (IAVH). They used SM2 and SM3 recorders from Wildlife Acoustics, programmed to record for 5 min every 10 min over five consecutive days, followed by a five-day recording hiatus. These recordings were part of the Global Environment Facility (GEF) project, aimed at characterizing the biodiversity of the remaining tropical dry forests in Colombia [48].

The soundscapes were manually labeled based on the level of ecological transformation (high, medium, low). IAVH researchers with expertise in vegetation and ecosystem types performed this categorization through direct field observation. A low transformation level was assigned to areas exhibiting a high proportion of preserved or newly regenerated forest. In contrast, areas with significant forest loss were labeled as having a high transformation level.

2.1.2 Rey zamuro dataset

The study was conducted at the Rey Zamuro and Matarredonda private reserves, located in the Municipality of San Martín (Meta Department, Colombia), between 3° 33′ 21.1″ N, 73° 24′ 41.9″ W and 3° 30′ 45.0″ N, 73° 23′ 11.0″W. The area, spanning 6000 hectares, comprises 60% natural savanna ecosystems, interspersed with areas of introduced pastures, while the remaining 40% is covered by forest ecosystems [49].

Acoustic recordings were carried out in September 2022. We deployed 93 AudioMoth acoustic devices across a 13 \(\times\) 8 grid, as depicted in Fig. 2. Each device was placed 400 m apart [50]. The devices were set to record at 192 kHz, operating for seven days and capturing one minute of audio every 14 min.

Fig. 2
figure 2

Geographical location of PAM devices at Rey Zamuro. Right panel: each black dot corresponds to a recording device

The soundscapes of Rey Zamuro were categorized as Forest, Savanna, or Pasture according to land cover. These classifications were determined according to the locations of the automatic recording units. We conducted a vegetation cover assessment using the smile random forest algorithm on the Google Earth Engine platform [51], applied to a low-cloud Sentinel-2A satellite image. The resulting vegetation cover map was validated and manually adjusted, incorporating information gathered during fieldwork.

2.2 Methods

The proposed methodology for finding transformation/coverage levels of ecosystems using temporal patterns of PAM recordings involves the following stages: (i) computing features from audios of every recording site and building temporal blocks that represent the site’s temporal dynamics throughout the day, (ii) classifying each block by leveraging such temporal information, and (iii) identifying relevant patterns present in each temporal block by using an interpretability technique that pinpoints the times of day that most aid the model in differentiating between classes. We will first introduce some main concepts about convolutional neural networks to provide an approach to the basic principles.

2.2.1 Convolutional Neural Network (CNN)

Convolutional neural networks are one of the deep learning models that have recently emerged as powerful models for ecoacoustic tasks [52]. CNNs consist of convolutional layers and fully connected layers [53, 54]. Convolutional layers apply learnable small filters or kernels (noted as F) to input data to detect patterns or features while sharing parameters across the input. Let I be the input data volume of size \(W_\textrm{in} \times H_\textrm{in} \times C\), where \(W_\textrm{in}\) is the width, \(H_\textrm{in}\) is the height, and C is the number of channels or depth. Kernels are usually small matrices of size \(f \times f \times C\), i.e., with the same depth as the input volume. The resulting feature map O from the convolution operation can be expressed as Equation (1):

$$\begin{aligned} { O} = \sum \limits _{m=0}^{f-1}\sum \limits _{n=0}^{f-1}\sum \limits _{c=0}^{C-1} { I}_{i+m, j+n, c}{} { F}_{m,n,c} + { b}, \end{aligned}$$
(1)

where b is the bias of the convolution. In matrix form, the convolution operation is given by Eq. (2):

$$\begin{aligned} { O} = { W}*{ I} + { {b}}, \end{aligned}$$
(2)

where W are the filter learnable parameters, i.e., the convolution weights.

The spatial dimensions of the output volume are smaller compared with the input volume size because the filters do not cover the entire input volume. The output size can be computed as:

$$\begin{aligned}{} & {} W_\textrm{out} = \left\lfloor \frac{{W_\textrm{in} - f + 2P}}{S} \right\rfloor + 1; \\{} & {} \quad H_\textrm{out} = \left\lfloor \frac{{H_\textrm{in} - f + 2P}}{S} \right\rfloor + 1, \end{aligned}$$

where \(W_\textrm{in}\), \(H_\textrm{in}\), \(W_\textrm{out}\), and \(H_\textrm{out}\), are the width and height of the input volume and the width and height of the output volume, respectively. P stands for the value of padding, i.e., adding extra border rows and columns to the input volume to preserve spatial information at the edges, S determines the stride, i.e., the step size for sliding filter over the input volume, and \(\left\lfloor \cdot \right\rfloor\) denotes a floor function, returning the largest integer less than or equal to its argument. Each convolution filter produces its own output feature map, capturing different patterns or features. Consequently, multiple filters are used in each convolutional layer, increasing the output volumes in depth.

An activation function \(\sigma\) is typically applied element-wise after each convolution to introduce nonlinearity. The Rectified Linear Unit (ReLU)—defined as \(\sigma (O) =\textrm{max}(0, O)\)—is the most commonly used activation function due to its simplicity and effectiveness. Finally, the activated features are followed by pooling operators that reduce spatial dimensions and computational complexity.

Convolutional layers reduce the manual feature design stage by automatically learning and extracting meaningful features from input data. In this sense, multiple convolutional layers compute low-, mid-, and high-level features.

Additionally, fully connected layers receive input from the learned high-level features extracted by the previous convolutional layers and use them to make predictions or classifications. Fully connected layers are traditional neural network layers that connect every neuron in one layer to every neuron in the next layer, considering the entire input volume, allowing the network to learn complex relationships between features.

To train CNNs in a supervised learning fashion, a loss function measures the difference between the predicted outputs from the network and the true output of the data. To minimize the loss function and learn from the data, the parameters of the network (W and b) are updated in a direction that reduces the loss using a gradient-based optimization algorithm.

2.2.2 Characterizing acoustic recordings

To extract features from acoustic recordings, we employ a well-established pretrained neural network widely recognized in the field of audio processing, named VGGish. This network’s architecture is specifically designed to generate compact yet informative representations—or deep embeddings—of audio signals, as detailed in [35]. Inspired by the visual geometry group (VGG) network architecture, originally devised for image classification [55], VGGish excels in producing deep embeddings that effectively capture pertinent acoustic features. These embeddings lay the groundwork for various audio processing tasks, including audio classification, content-based retrieval, and acoustic scene comprehension, as demonstrated in previous studies [36, 37]. Trained on the extensive and diverse AudioSet [56], a large-scale, publicly accessible audio dataset featuring millions of annotated clips across 527 categories–from animal vocalizations and musical instruments to human activities and environmental noises, VGGish is adept at handling a broad range of audio types.

The architecture of VGGish incorporates multiple convolutional, max-pooling layers with ReLU activation functions, followed by three fully connected layers with dimensions 4096, 4096, and 128. The model processes audio by segmenting it into 0.96-second clips, each transformed into a log-Mel spectrogram that feeds into the neural network. Convolutional layers, equipped with learnable filters, analyze the input spectrogram to identify local patterns and distill low-level features. Max-pooling layers follow, reducing the feature maps’ spatial dimensions while preserving essential information. This step is crucial for capturing and maintaining relevant patterns across various scales, further refining the representations.

The culmination of the process involves the fully connected layers. They transform the aggregated output from the convolutional and max-pooling layers into a lower-dimensional space, producing a 128-dimensional representation. This final mapping is designed to encapsulate global and high-level dependencies, resulting in deep embeddings that encode significant audio signal details. These embeddings are then used as inputs for further analysis, whether through shallow or more complex deep learning methods.

In our setup, we computed 312 deep embeddings for each 5-minute recording from the Caribbean dataset and 62 deep embeddings for the one-minute recordings from the Rey Zamuro dataset. Each set of embeddings was subsequently averaged to generate a single 128-dimensional deep embedding vector. In addition, to compare the features produced by the VGGish model, we computed 60 EAI from each acoustic recording using the Python package scikit-maad [57]. Of these indices, 16 were derived from the time domain and 44 from the spectral domain.

2.2.3 Column-wise 1D Convolutional Neural Network

In the second stage of our study, our objective is to construct temporal blocks that encapsulate the site’s daily temporal dynamics, utilizing the features computed in the initial stage. For this purpose, we select features from audio recordings taken at each hour of the day and arrange them in a row-wise fashion to form a matrix of dimensions \(24 \times m\). Here, m represents the number of features, which is 60 when using Ecological Acoustic Indices (EAI) or 128 when employing deep learning-based features. This process is shown in Fig. 3. Given that multiple audio recordings are available for each hour in the recording process, we randomly select one recording per hour. This approach enables us to augment the quantity of temporal blocks available for the training process.

Fig. 3
figure 3

Proposed temporal blocks for EAI and deep embeddings. These features are selected based on the time of day and stacked in rows. This creates a temporal block in which each row is associated with each hour of the day. The resulting temporal block is labeled according to its recording unit location label

The feature matrix is then transformed into a one-dimensional vector with a size of \(1 \times 24m\). This vector offers a comprehensive representation of the daily acoustic profile, capturing the essence of the soundscape over a 24-hour period. Subsequently, we introduce a 1D convolutional layer equipped with 64 kernels of size 24 (a stride value to ensure the hour-based combination). Each kernel in this layer is adept at identifying and emphasizing similar audio patterns across different hours of the day. This step is crucial for identifying consistent and distinct temporal patterns throughout the day. The feature map provided by this convolutional layer is fed to several fully connected layers to fulfill the classification process, as depicted in Fig. 4.

Fig. 4
figure 4

Proposed 1D convolutional neural network to leverage temporal information. The 1D CNN is used on the temporal blocks to extract patterns, which are used to train a fully connected network

Finally, the corresponding label is assigned to the built temporal block for each feature set (EAI or VGGish). For Caribbean data, the labels represent the transformation level of the study site. For Rey Zamuro, the labels correspond to the land cover class.

2.3 Eigen-CAM-based visualization

Neural networks are often regarded as ‘black boxes,’ obscuring the interpretability of their classification decisions. This aspect becomes particularly significant when using deep learning-based embeddings to process soundscapes, as opposed to traditional EAI that inherently carry phenomenological meaning. To address this issue, we compute and visualize the principal components of the learned feature maps from convolutional layers. This approach lets us identify which features remain relevant throughout the neural network’s processing. Such visualization and analysis can be effectively conducted using the Eigen-CAM technique, a gradient-free and class-independent method, as outlined in [58].

Our goal is to interpret the input features within the 24 h temporal block by examining the patterns captured by the 64 1D convolutional filters. Each filter is specifically designed to identify patterns throughout the 24-hour cycle. For this purpose, let \({I} \in \Re ^{N \times 24 \times 128}\) represent the neural network input, where N is the number of temporal blocks. The projection of I onto the 1D convolutional layer in question is defined as \({O} = {W}*{I} + {b}\), where \({W} \in \Re ^{64 \times 1 \times 24}\) and \({b} \in \Re ^{64}\) are the weights and bias of the convolutional layer, respectively, and \({O} \in \Re ^{N \times 64 \times 128}\) is the resulting feature map. The singular value decomposition of each n-th element in the batch of the feature map, \({O}_n \in \Re ^{24 \times 128}\) with \(n \in \{1,\dots , N\}\), is given by Equation (3):

$$\begin{aligned} {O}_n = {U}_n\mathbf {\Sigma }_n {V}_n^{\top }; \end{aligned}$$
(3)

where \({U}_n \in \Re ^{24 \times 24}\) is an orthogonal matrix known as the left singular vector, \(\mathbf {\Sigma }_n \in \Re ^{24 \times 128}\) is a diagonal matrix containing the singular values, and \({V}_n \in \Re ^{128 \times 128}\) is an orthogonal matrix known as the right singular vectors. To estimate the feature relevance, the projection onto the first right eigenvector is computed, as expressed in Equation (4):

$$\begin{aligned} {A}_n = {O}_n{v}_n^{(1)}, \end{aligned}$$
(4)

where \({v}_n^{(1)}\) is the first right eigenvector of \({O}_n\) and \({A}_n \in {R}^{24 \times 1}\) is the relevance by hour of the n-th element of the batch. Finally, the average relevance \({A}\in {R}^{24 \times 1}\) for each hour is computed by averaging the relevance across all batch elements. The feature relevance matrix A is frequently utilized as a heatmap to visually highlight relevant patterns within the input data.

3 Experimental setup

To mitigate the risk of over-fitting our machine learning models, we allocated 80% of the recordings from each recorder to generate 100 temporal blocks for the training set, while the remaining 20% were used to create 30 temporal blocks for the testing set. Additionally, to validate the classifiers’ performance and fine-tune the hyperparameters, we extracted 20% of the training set to serve as the validation set. As a result, for the Caribbean study site, we obtained 1920 temporal blocks for training, 480 for validation, and 720 for testing. Similarly, for Rey Zamuro, we obtained 7440 temporal blocks for training, 1860 for validation, and 2790 for testing.

After partitioning the data, we trained and tested three supervised models to classify the temporal blocks: a random forest, a neural network, and our proposed convolutional neural network based on 1D convolutions. For both the random forest and neural network, we flattened the temporal blocks into a one-dimensional vector, yielding a vector of \(24 \times m\) features, which served as the input for these models. We determined the optimal hyperparameters for each supervised model using a grid search approach. The final configurations were as follows: for the random forest, we used a maximum depth of 20, the Gini criterion, and 100 trees. The neural network comprised an input layer of size \(24 \times m\), a hidden layer of size 150 with a ReLU activation function, and an output layer corresponding to the three classes with a Softmax activation function.

In our proposed column-wise 1D convolutional neural network, as illustrated in Fig. 4, the fully connected head begins with an input layer comprising \(64 \times m\) neurons. The network features two hidden layers with dimensions of 512 and 128, respectively. Each layer is followed by ReLU activation functions and dropout regularization, with a dropout probability p of 0.5. The network concludes with an output layer sized to accommodate the three classes (transformation or coverage), incorporating a Softmax activation function. This network and the previously mentioned neural network were trained during 50 epochs with a batch size of 64. We utilized the Adam optimizer to minimize the cross-entropy loss between the predicted and actual labels. The learning rate was set at 0.005 for the neural network, while for the 1D convolutional neural network, a lower rate of 0.0001 was applied.

4 Results

Table 1 presents a comparative analysis of using EAI or VGGish deep embeddings as feature extraction methods for soundscapes with three different classifiers: random forest, neural network, and the proposed 1D convolutional neural network. Precision, recall, and F1-score metrics are reported according to their performance in discriminating between classes for the Caribbean and Rey Zamuro datasets. In addition, confusion matrices for both feature extraction methods on the three classifiers are shown in Fig. 5.

Table 1 Classification results for the test temporal blocks for Caribbean and Rey Zamuro data with EAI and VGGish deep embeddings
Fig. 5
figure 5

Confusion matrices for test temporal blocks for Caribbean data on EAI (Fig. 5a to 5c) and VGGish deep embeddings (Fig. 5d to 5f), and Rey Zamuro data on EAI (Fig. 5g to 5i) and VGGish deep embeddings (Fig. 5j to 5l)

For the Caribbean dataset, VGGish yielded similar results to those obtained with EAI, with variations of less than 2%. However, the performance with VGGish features in the Rey Zamuro dataset surpassed that of the EAI across all three classifiers, allowing a performance increase of up to 14%. Furthermore, the proposed 1D convolutional neural network outperformed random forest and neural network models in identifying all classes, regardless of the characterization of the soundscapes, achieving up to 10% more accuracy. Notably, for the Rey Zamuro dataset, the performance in the Savanna and forest categories is noteworthy for the 1D convolutional neural network, maintaining high metrics across all categories, particularly when these classes represent a minority compared to the Pasture class, as illustrated in Fig. 5g–l.

These results show that deep learning-based features present an alternative to traditional EAI, facilitating transfer learning by utilizing pretrained models for feature extraction from acoustic data. These features can then be employed with shallow or deep classifiers. Moreover, the incorporation of 24-hour temporal blocks enables extracting temporal patterns during the training process. Specifically, 1D convolutions exploit the recorded information to integrate 24-hour patterns for each feature.

Then, we used the Eigen-CAM technique to provide interpretability to the results obtained for the input features within the 24-hour temporal block with the 1D convolutional neural network, focusing on the 64 one-dimensional convolutional filters designed to capture patterns across a daily cycle. The Eigen-CAM results of Fig. 6 showcase the hours of the day that most contribute to discriminating between classes in both datasets. The plots display the influence of each hour of the day for discriminating among classes based on their acoustic features.

Fig. 6
figure 6

Estimated importance by the Eigen-CAM technique of the hours of the day for the final decision of the 1D convolutional network for each class

Figure 6a–c show how the proposed 1D convolutional neural network is more influenced by the hour ranges of 17:00–18:00, 11:00–12:00, and 04:00–05:00 for low, medium, and high transformation levels in the Caribbean dataset, respectively. Similarly, different time intervals play a crucial role in classification of the Rey Zamuro dataset: 19:00 and 22:00 for pasture, 20:00 for savanna, and 15:00, 19:00, and 20:00 for forest, as shown in Fig. 6d–f. Although partially coinciding with sunset and sunrise, this is not intuitive information and would lead to further studies to understand, for example, why the forest in Rey Zamuro is better characterized at 3 pm.

These results evidence that Eigen-CAM can provide interpretability to the discriminatory process of the 1D convolutional neural network in distinguishing between levels of transformation/coverage. Moreover, this technique enables visualization of the relevance level associated with each hour of the day within the 24-hour temporal blocks constructed from deep learning-based features. Such interpretations facilitate bridging the gap that may exist between EAI, which intrinsically offers interpretability, and deep embeddings.

5 Discussion

Our approach quantitatively evaluates how well temporal blocks can identify ecosystem patterns by capturing the daily dynamics of soundscapes. Additionally, our methodology facilitates the interpretation of significant temporal patterns within these blocks, pinpointing specific times of day that are crucial for the model’s decision-making process. This capability not only enhances the precision of ecological monitoring but also provides deeper insights into the temporal variability of ecosystems, potentially informing targeted conservation strategies.

In the initial stage of our study, we extracted features using VGGish deep embeddings and compared their effectiveness with EAI, traditionally used in soundscape analysis. Our results demonstrate that deep embeddings provide a more accurate identification of ecological patterns across both study sites, corroborating findings from several state-of-the-art studies that suggest pretrained deep neural networks offer a viable alternative to traditional EAI [59, 24, 41]. Despite these advantages, employing these pretrained models introduces challenges, particularly in data-intensive settings. For example, while VGGish facilitates detailed segment characterization down to approximately 16 ms, this granularity results in an extensive feature set in large-scale applications. To manage this, we averaged features over one or five-minute intervals depending on the study site, which, while necessary to reduce computational load, may lead to significant information loss. This trade-off highlights a critical area for future research: optimizing the balance between detail and manageability in feature extraction for large-scale ecological monitoring.

In the next stage, we used the extracted features to construct 24 temporal blocks designed to represent the daily dynamics of each recording site. These blocks were then used to train a 1D CNN, specifically engineered to leverage this hourly information to identify the labels assigned to each recording site. In contrast, we also fed the same blocks to conventional supervised classification models, such as random forests and neural networks, which do not inherently exploit the input data structure. The results demonstrated a significant improvement with the 1D CNN, underscoring the benefits of appropriately leveraging the temporal structure of the recordings.

Literature has typically included the temporal structure of large-scale soundscape studies as a component analyzed in the post-processing stage [38, 41]. A pivotal finding in our work is that integrating such temporal information directly into the learning process markedly enhances both the performance and interpretability of soundscape analysis. However, a potential limitation in our approach is the use of 1D convolutions instead of more sophisticated models like recurrent neural networks, transformers, or state-space models, which might manage temporal information more effectively. We opted for 1D CNNs due to their simplicity and the fewer parameters required, considering the volume of temporal blocks at hand. Nonetheless, this decision opens up new avenues for future research, particularly in exploring advanced models that could further optimize temporal data handling.

Finally, we employed Eigen-CAM to enhance the interpretability of the proposed 1D convolutional neural network. Specifically, we sought to identify the hours of the day that most significantly contribute to the classification, pinpointing when the soundscapes of different ecosystem types diverge the most. Traditionally, Eigen-CAM has been utilized to elucidate the operational mechanisms of mostly convolutional neural networks by visually highlighting what they capture in images. This has proven invaluable in fields such as medical imaging [60, 61], security [62], and pedestrian identification [63], among others. In our study, Eigen-CAM revealed the critical hourly information within the temporal blocks as interpreted by the 1D convolutional network. Our analysis showed that specific hours not usually associated with peak activity (11:00–12:00 and 19:00–20:00), as well as periods around sunrise (04:00–06:00) and sunset (17:00–18:00), are particularly important for distinguishing between transformation levels in the Caribbean dataset and cover types in the Rey Zamuro dataset. This insight is instrumental for soundscape studies, providing a deeper understanding of patterns that could potentially impact ecosystem health and aiding in the development of targeted conservation strategies.

In summary, our study underscores the significant potential of employing temporal blocks in conjunction with 1D convolutional neural networks to enhance the precision and interpretability of soundscape analyses. By directly integrating temporal information into the learning process, we not only improve model performance but also provide deeper insights into the ecological dynamics at play. The use of Eigen-CAM has been instrumental in highlighting crucial temporal intervals that influence classification outcomes, offering a novel approach to understanding ecological patterns. While our methodology represents a notable advancement in ecoacoustic monitoring, the identified limitations and trade-offs point toward substantial opportunities for future research. Specifically, exploring more sophisticated temporal models could further refine our understanding and management of ecological data. As ecosystems worldwide continue to face unprecedented changes, the methodologies developed in this study offer promising tools for researchers and conservationists to monitor and understand these complex environments more effectively.

6 Conclusion

This study introduces a methodology to highlight ecologically meaningful information from PAM recordings. It has two key characteristics: deep learning-based features as a viable alternative to traditional EAI and enhancing temporal dynamics through 24-hour temporal blocks. The temporal embedding within the proposed blocks proved beneficial for distinguishing between varying levels of ecosystem transformation or land covers. Moreover, employing CAM techniques affords interpretability to the deep embeddings, which, unlike EAI, are not inherently interpretable. As a result, the proposed approach provided crucial time-of-day information to discriminate between classes.

Looking ahead, the exploration of alternative architectures as feature generators is essential. Additionally, the potential of other models, such as recurrent neural networks to leverage the temporal information of soundscapes warrants investigation. Lastly, the application of unsupervised learning models to this data should be considered, given their independence from labeled datasets.