Abstract
Passive acoustic monitoring (PAM) is an effective, non-intrusive method for studying ecosystems, but obtaining meaningful ecological information from its large number of audio files is challenging. In this study, we take advantage of the expected animal behavior at different times of the day (e.g., higher acoustic animal activity at dawn) and develop a novel approach to use these time-based patterns. We organize PAM data into 24-hour temporal blocks formed with sound features from a pretrained VGGish network. These features feed a 1D convolutional neural network with a class activation mapping technique that gives interpretability to its outcomes. As a result, these diel-cycle blocks offer more accurate and robust hour-by-hour information than using traditional ecological acoustic indices as features, effectively recognizing key ecosystem patterns.
Explore related subjects
Find the latest articles, discoveries, and news in related topics.Avoid common mistakes on your manuscript.
1 Introduction
Soundscape ecology, which studies the relationships between the landscape and its sounds at multiple temporal and spatial scales [1, 2], has been used to understand ecological processes [3, 4] and impulse conservation plans to mitigate biodiversity loss [5,6,7,8]. However, passive acoustic monitoring (PAM) requires analyzing audio recordings from tens/hundreds of devices, bringing in several processing and interpretability challenges [9]. Consequently, a growing need exists to compute features representing the soundscape that are useful to assess landscape characteristics [10] that can be applied to, for example, quantifying landscape degradation [11, 12].
To address these challenges, ecological acoustic indices (EAI) have been developed to encode complex and biologically relevant information in a single value by analyzing sound properties, including intensity, frequency, and temporal and spectral patterns, which ease and speed up the process of large volumes of data [13, 14]. These indices, along with statistical and machine learning methods, have been widely used to elucidate ecosystem attributes such as wildlife presence [15], bioacoustic activity [16], habitat quality [17], transformation level [18], or to predict habitat type and vegetation structure [19]. However, these indices are sensitive to low signal-to-noise ratios in the recording process [20] and noise (such as geophony or anthropophony) masking sounds of interest [21]. Moreover, several studies use the same indices with contradictory results when correlated with biological variables [22].
In contrast to hand-crafted features such as EAI, deep learning-based approaches exploit data structure to learn representations of the acoustic recordings [23,24,25], which helps these models adapt to several classification tasks [26,27,28,29]. However, a large amount of labeled data is usually required for these models to learn suitable data representations. Moreover, manual annotations from experts are scarce in soundscape studies, for which data-driven alternatives must be developed [30, 31]. Consequently, transfer learning emerged as the alternative to applying deep learning methods in acoustic recordings due to its inherent ability to adapt knowledge from one or several source domains to a target domain [28, 32, 31].
Among deep learning methods, convolutional neural networks (CNNs) are particularly notable for their ability to automatically learn features in their convolutional layers, making them highly effective for acoustic analysis in ecoacoustics. For instance, CityNet [33] utilizes CNNs to quantify biotic and anthropogenic sounds in urban environments, while BirdNet [34], based on the ResNet architecture, specializes in identifying bird species from their sounds. Another prominent example is VGGish, a well-established pretrained neural network renowned for its capability in audio processing. Specifically designed for transfer and zero-shot learning, VGGish generates compact yet informative deep embeddings of audio signals that capture essential acoustic features [35]. These embeddings are helpful for various audio processing tasks such as audio classification, content-based retrieval, and acoustic scene understanding, with applications demonstrated in numerous studies [36, 37]. Recently, VGGish features have also been investigated for their potential as ecological indicators, further expanding their applicability in environmental research. In [38], VGGish deep embeddings are used in both supervised and unsupervised tasks as a universal feature set, yielding ecological insight across spatiotemporal scales and enabling the detection of anomalies. Other studies used transfer learning on VGGish as a feature extraction method to identify bird species [39], avian richness [40], sound components [41], and to discriminate between marine mammal species [42].
Another aspect to consider in PAM studies is including the temporal domain. In this regard, authors have observed how some times of the day offer more relevant acoustic information than others [21]. For example, some taxonomic groups modulate their acoustic activity dynamics with the diel cycle [43]. Moreover, in [29], the authors found a link between soundscape components across time. They found that different patterns may be revealed by analyzing acoustic features on finer time scales. From a deep learning perspective, several studies have utilized features extracted from pretrained models to analyze spatiotemporal patterns in ecosystems. For instance, [38] explores the use of VGGish-based features to discern seasonal or daily patterns, employing supervised models trained on these features. Additionally, [41] leverages unsupervised learning models to classify sound component types from VGGish-based features, which are then used to detect spatiotemporal variations in ecosystems. Despite these advances demonstrating the significance of incorporating temporal information in audio analysis, these studies typically apply such temporal data during post-processing rather than directly integrating it into the learning process. This approach prevents machine learning models from fully utilizing this valuable information. Therefore, developing models or architectures that incorporate and exploit these temporal patterns directly during the learning process is crucial for enhancing performance and facilitating interpretability in machine learning applications.
Additionally, these models must be interpretable. For EAI, each index is designed to underscore specific attributes inherent to the landscape. Conversely, the interpretability of deep embeddings often presents a challenge, as patterns across the training data influence them. This holds even for non-related soundscape recordings when transfer learning is employed [44]. Consequently, establishing a correlation between not biologically based features and ecosystem outcomes becomes complex. To cope with this issue, class activation mapping (CAM) techniques have been employed in sound classification or detection tasks to examine the effectiveness of the constructed feature maps and visualize which temporal or frequency regions play a significant role in the network decision [45,46,47].
In this study, we present a model architecture that enables machine learning techniques to directly incorporate temporal information during the learning process. This approach overcomes the limitation of including temporal information in a post-processing stage. We hypothesize that by feeding machine learning models with temporal blocks—representing the daily dynamics of soundscapes—we can significantly improve model performance in identifying ecological patterns and facilitate their interpretability. The proposed methodology is structured into three main steps: Initially, we extract deep learning-based features from the audio recordings at each recording site and organize these into 24-hour temporal blocks to capture the site’s daily dynamics. These temporal blocks are then used to train supervised models, leveraging the temporal information to classify the ecological patterns at the locations where each audio device is situated based entirely on its audio recordings. Additionally, we employ an interpretability technique to identify the most significant temporal patterns within the blocks, pinpointing specific times of day that enhance the model’s ability to differentiate among classes.
Our study focuses on PAM data (a rich source of ecological information) from two tropical dry forest sites. Each site is equipped with multiple recorders arranged in a grid, providing comprehensive coverage of the study area. In the first study site, recorders were positioned in areas characterized by one of three transformation levels: low, medium, or high. In the second study site, the recorders were located in areas classified into cover classes such as forest, savanna, or pasture. This diverse and extensive dataset allows us to test the effectiveness of our proposed model architecture in a variety of ecological contexts.
To test our approach, we compare it with a version trained with EAI features and supervised learning techniques such as random forest and a feed-forward neural network. As a result, we demonstrate that 24-hour temporal blocks allow exploiting acoustic information to identify transformation/coverage levels in both ecosystems. Moreover, the network trained with deep learning-based features also outperforms the models trained with EAI. Thus, the CAM technique provides the level of relevance of each time of the day when performing the classification.
This paper is structured as follows: the materials and methods are presented in Sect. 2; the experimental setup is presented in Sect. 3; followed by the results in Sect. 4 and the discussion in Sect. 5 which are concluded in Sect. 6.
2 Materials and methods
2.1 Study sites
2.1.1 Caribbean dataset
The study area encompasses tropical dry forest ecosystems situated in the departments of La Guajira and Bolivar in the Colombian Caribbean region, specifically within the Arroyo and Cañas River basins. Twenty-four automatic recording units were located around 11° 11′ 11.0″ N, 73° 26′ 32.9″ W for the Guajira department and around 9° 56′ 23.8″ N, 75° 10′ 06.4″ W in Bolivar (see Fig. 1). High levels of endemism characterize these sites and exhibit considerable rainfall variability. They are located at elevations ranging from 0 to 1000 m and experience dry periods lasting between 3 and 6 months.
Acoustic recordings were acquired between December 2015 and March 2017 by the Alexander von Humboldt Institute (IAVH). They used SM2 and SM3 recorders from Wildlife Acoustics, programmed to record for 5 min every 10 min over five consecutive days, followed by a five-day recording hiatus. These recordings were part of the Global Environment Facility (GEF) project, aimed at characterizing the biodiversity of the remaining tropical dry forests in Colombia [48].
The soundscapes were manually labeled based on the level of ecological transformation (high, medium, low). IAVH researchers with expertise in vegetation and ecosystem types performed this categorization through direct field observation. A low transformation level was assigned to areas exhibiting a high proportion of preserved or newly regenerated forest. In contrast, areas with significant forest loss were labeled as having a high transformation level.
2.1.2 Rey zamuro dataset
The study was conducted at the Rey Zamuro and Matarredonda private reserves, located in the Municipality of San Martín (Meta Department, Colombia), between 3° 33′ 21.1″ N, 73° 24′ 41.9″ W and 3° 30′ 45.0″ N, 73° 23′ 11.0″W. The area, spanning 6000 hectares, comprises 60% natural savanna ecosystems, interspersed with areas of introduced pastures, while the remaining 40% is covered by forest ecosystems [49].
Acoustic recordings were carried out in September 2022. We deployed 93 AudioMoth acoustic devices across a 13 \(\times\) 8 grid, as depicted in Fig. 2. Each device was placed 400 m apart [50]. The devices were set to record at 192 kHz, operating for seven days and capturing one minute of audio every 14 min.
The soundscapes of Rey Zamuro were categorized as Forest, Savanna, or Pasture according to land cover. These classifications were determined according to the locations of the automatic recording units. We conducted a vegetation cover assessment using the smile random forest algorithm on the Google Earth Engine platform [51], applied to a low-cloud Sentinel-2A satellite image. The resulting vegetation cover map was validated and manually adjusted, incorporating information gathered during fieldwork.
2.2 Methods
The proposed methodology for finding transformation/coverage levels of ecosystems using temporal patterns of PAM recordings involves the following stages: (i) computing features from audios of every recording site and building temporal blocks that represent the site’s temporal dynamics throughout the day, (ii) classifying each block by leveraging such temporal information, and (iii) identifying relevant patterns present in each temporal block by using an interpretability technique that pinpoints the times of day that most aid the model in differentiating between classes. We will first introduce some main concepts about convolutional neural networks to provide an approach to the basic principles.
2.2.1 Convolutional Neural Network (CNN)
Convolutional neural networks are one of the deep learning models that have recently emerged as powerful models for ecoacoustic tasks [52]. CNNs consist of convolutional layers and fully connected layers [53, 54]. Convolutional layers apply learnable small filters or kernels (noted as F) to input data to detect patterns or features while sharing parameters across the input. Let I be the input data volume of size \(W_\textrm{in} \times H_\textrm{in} \times C\), where \(W_\textrm{in}\) is the width, \(H_\textrm{in}\) is the height, and C is the number of channels or depth. Kernels are usually small matrices of size \(f \times f \times C\), i.e., with the same depth as the input volume. The resulting feature map O from the convolution operation can be expressed as Equation (1):
where b is the bias of the convolution. In matrix form, the convolution operation is given by Eq. (2):
where W are the filter learnable parameters, i.e., the convolution weights.
The spatial dimensions of the output volume are smaller compared with the input volume size because the filters do not cover the entire input volume. The output size can be computed as:
where \(W_\textrm{in}\), \(H_\textrm{in}\), \(W_\textrm{out}\), and \(H_\textrm{out}\), are the width and height of the input volume and the width and height of the output volume, respectively. P stands for the value of padding, i.e., adding extra border rows and columns to the input volume to preserve spatial information at the edges, S determines the stride, i.e., the step size for sliding filter over the input volume, and \(\left\lfloor \cdot \right\rfloor\) denotes a floor function, returning the largest integer less than or equal to its argument. Each convolution filter produces its own output feature map, capturing different patterns or features. Consequently, multiple filters are used in each convolutional layer, increasing the output volumes in depth.
An activation function \(\sigma\) is typically applied element-wise after each convolution to introduce nonlinearity. The Rectified Linear Unit (ReLU)—defined as \(\sigma (O) =\textrm{max}(0, O)\)—is the most commonly used activation function due to its simplicity and effectiveness. Finally, the activated features are followed by pooling operators that reduce spatial dimensions and computational complexity.
Convolutional layers reduce the manual feature design stage by automatically learning and extracting meaningful features from input data. In this sense, multiple convolutional layers compute low-, mid-, and high-level features.
Additionally, fully connected layers receive input from the learned high-level features extracted by the previous convolutional layers and use them to make predictions or classifications. Fully connected layers are traditional neural network layers that connect every neuron in one layer to every neuron in the next layer, considering the entire input volume, allowing the network to learn complex relationships between features.
To train CNNs in a supervised learning fashion, a loss function measures the difference between the predicted outputs from the network and the true output of the data. To minimize the loss function and learn from the data, the parameters of the network (W and b) are updated in a direction that reduces the loss using a gradient-based optimization algorithm.
2.2.2 Characterizing acoustic recordings
To extract features from acoustic recordings, we employ a well-established pretrained neural network widely recognized in the field of audio processing, named VGGish. This network’s architecture is specifically designed to generate compact yet informative representations—or deep embeddings—of audio signals, as detailed in [35]. Inspired by the visual geometry group (VGG) network architecture, originally devised for image classification [55], VGGish excels in producing deep embeddings that effectively capture pertinent acoustic features. These embeddings lay the groundwork for various audio processing tasks, including audio classification, content-based retrieval, and acoustic scene comprehension, as demonstrated in previous studies [36, 37]. Trained on the extensive and diverse AudioSet [56], a large-scale, publicly accessible audio dataset featuring millions of annotated clips across 527 categories–from animal vocalizations and musical instruments to human activities and environmental noises, VGGish is adept at handling a broad range of audio types.
The architecture of VGGish incorporates multiple convolutional, max-pooling layers with ReLU activation functions, followed by three fully connected layers with dimensions 4096, 4096, and 128. The model processes audio by segmenting it into 0.96-second clips, each transformed into a log-Mel spectrogram that feeds into the neural network. Convolutional layers, equipped with learnable filters, analyze the input spectrogram to identify local patterns and distill low-level features. Max-pooling layers follow, reducing the feature maps’ spatial dimensions while preserving essential information. This step is crucial for capturing and maintaining relevant patterns across various scales, further refining the representations.
The culmination of the process involves the fully connected layers. They transform the aggregated output from the convolutional and max-pooling layers into a lower-dimensional space, producing a 128-dimensional representation. This final mapping is designed to encapsulate global and high-level dependencies, resulting in deep embeddings that encode significant audio signal details. These embeddings are then used as inputs for further analysis, whether through shallow or more complex deep learning methods.
In our setup, we computed 312 deep embeddings for each 5-minute recording from the Caribbean dataset and 62 deep embeddings for the one-minute recordings from the Rey Zamuro dataset. Each set of embeddings was subsequently averaged to generate a single 128-dimensional deep embedding vector. In addition, to compare the features produced by the VGGish model, we computed 60 EAI from each acoustic recording using the Python package scikit-maad [57]. Of these indices, 16 were derived from the time domain and 44 from the spectral domain.
2.2.3 Column-wise 1D Convolutional Neural Network
In the second stage of our study, our objective is to construct temporal blocks that encapsulate the site’s daily temporal dynamics, utilizing the features computed in the initial stage. For this purpose, we select features from audio recordings taken at each hour of the day and arrange them in a row-wise fashion to form a matrix of dimensions \(24 \times m\). Here, m represents the number of features, which is 60 when using Ecological Acoustic Indices (EAI) or 128 when employing deep learning-based features. This process is shown in Fig. 3. Given that multiple audio recordings are available for each hour in the recording process, we randomly select one recording per hour. This approach enables us to augment the quantity of temporal blocks available for the training process.
Proposed temporal blocks for EAI and deep embeddings. These features are selected based on the time of day and stacked in rows. This creates a temporal block in which each row is associated with each hour of the day. The resulting temporal block is labeled according to its recording unit location label
The feature matrix is then transformed into a one-dimensional vector with a size of \(1 \times 24m\). This vector offers a comprehensive representation of the daily acoustic profile, capturing the essence of the soundscape over a 24-hour period. Subsequently, we introduce a 1D convolutional layer equipped with 64 kernels of size 24 (a stride value to ensure the hour-based combination). Each kernel in this layer is adept at identifying and emphasizing similar audio patterns across different hours of the day. This step is crucial for identifying consistent and distinct temporal patterns throughout the day. The feature map provided by this convolutional layer is fed to several fully connected layers to fulfill the classification process, as depicted in Fig. 4.
Finally, the corresponding label is assigned to the built temporal block for each feature set (EAI or VGGish). For Caribbean data, the labels represent the transformation level of the study site. For Rey Zamuro, the labels correspond to the land cover class.
2.3 Eigen-CAM-based visualization
Neural networks are often regarded as ‘black boxes,’ obscuring the interpretability of their classification decisions. This aspect becomes particularly significant when using deep learning-based embeddings to process soundscapes, as opposed to traditional EAI that inherently carry phenomenological meaning. To address this issue, we compute and visualize the principal components of the learned feature maps from convolutional layers. This approach lets us identify which features remain relevant throughout the neural network’s processing. Such visualization and analysis can be effectively conducted using the Eigen-CAM technique, a gradient-free and class-independent method, as outlined in [58].
Our goal is to interpret the input features within the 24 h temporal block by examining the patterns captured by the 64 1D convolutional filters. Each filter is specifically designed to identify patterns throughout the 24-hour cycle. For this purpose, let \({I} \in \Re ^{N \times 24 \times 128}\) represent the neural network input, where N is the number of temporal blocks. The projection of I onto the 1D convolutional layer in question is defined as \({O} = {W}*{I} + {b}\), where \({W} \in \Re ^{64 \times 1 \times 24}\) and \({b} \in \Re ^{64}\) are the weights and bias of the convolutional layer, respectively, and \({O} \in \Re ^{N \times 64 \times 128}\) is the resulting feature map. The singular value decomposition of each n-th element in the batch of the feature map, \({O}_n \in \Re ^{24 \times 128}\) with \(n \in \{1,\dots , N\}\), is given by Equation (3):
where \({U}_n \in \Re ^{24 \times 24}\) is an orthogonal matrix known as the left singular vector, \(\mathbf {\Sigma }_n \in \Re ^{24 \times 128}\) is a diagonal matrix containing the singular values, and \({V}_n \in \Re ^{128 \times 128}\) is an orthogonal matrix known as the right singular vectors. To estimate the feature relevance, the projection onto the first right eigenvector is computed, as expressed in Equation (4):
where \({v}_n^{(1)}\) is the first right eigenvector of \({O}_n\) and \({A}_n \in {R}^{24 \times 1}\) is the relevance by hour of the n-th element of the batch. Finally, the average relevance \({A}\in {R}^{24 \times 1}\) for each hour is computed by averaging the relevance across all batch elements. The feature relevance matrix A is frequently utilized as a heatmap to visually highlight relevant patterns within the input data.
3 Experimental setup
To mitigate the risk of over-fitting our machine learning models, we allocated 80% of the recordings from each recorder to generate 100 temporal blocks for the training set, while the remaining 20% were used to create 30 temporal blocks for the testing set. Additionally, to validate the classifiers’ performance and fine-tune the hyperparameters, we extracted 20% of the training set to serve as the validation set. As a result, for the Caribbean study site, we obtained 1920 temporal blocks for training, 480 for validation, and 720 for testing. Similarly, for Rey Zamuro, we obtained 7440 temporal blocks for training, 1860 for validation, and 2790 for testing.
After partitioning the data, we trained and tested three supervised models to classify the temporal blocks: a random forest, a neural network, and our proposed convolutional neural network based on 1D convolutions. For both the random forest and neural network, we flattened the temporal blocks into a one-dimensional vector, yielding a vector of \(24 \times m\) features, which served as the input for these models. We determined the optimal hyperparameters for each supervised model using a grid search approach. The final configurations were as follows: for the random forest, we used a maximum depth of 20, the Gini criterion, and 100 trees. The neural network comprised an input layer of size \(24 \times m\), a hidden layer of size 150 with a ReLU activation function, and an output layer corresponding to the three classes with a Softmax activation function.
In our proposed column-wise 1D convolutional neural network, as illustrated in Fig. 4, the fully connected head begins with an input layer comprising \(64 \times m\) neurons. The network features two hidden layers with dimensions of 512 and 128, respectively. Each layer is followed by ReLU activation functions and dropout regularization, with a dropout probability p of 0.5. The network concludes with an output layer sized to accommodate the three classes (transformation or coverage), incorporating a Softmax activation function. This network and the previously mentioned neural network were trained during 50 epochs with a batch size of 64. We utilized the Adam optimizer to minimize the cross-entropy loss between the predicted and actual labels. The learning rate was set at 0.005 for the neural network, while for the 1D convolutional neural network, a lower rate of 0.0001 was applied.
4 Results
Table 1 presents a comparative analysis of using EAI or VGGish deep embeddings as feature extraction methods for soundscapes with three different classifiers: random forest, neural network, and the proposed 1D convolutional neural network. Precision, recall, and F1-score metrics are reported according to their performance in discriminating between classes for the Caribbean and Rey Zamuro datasets. In addition, confusion matrices for both feature extraction methods on the three classifiers are shown in Fig. 5.
For the Caribbean dataset, VGGish yielded similar results to those obtained with EAI, with variations of less than 2%. However, the performance with VGGish features in the Rey Zamuro dataset surpassed that of the EAI across all three classifiers, allowing a performance increase of up to 14%. Furthermore, the proposed 1D convolutional neural network outperformed random forest and neural network models in identifying all classes, regardless of the characterization of the soundscapes, achieving up to 10% more accuracy. Notably, for the Rey Zamuro dataset, the performance in the Savanna and forest categories is noteworthy for the 1D convolutional neural network, maintaining high metrics across all categories, particularly when these classes represent a minority compared to the Pasture class, as illustrated in Fig. 5g–l.
These results show that deep learning-based features present an alternative to traditional EAI, facilitating transfer learning by utilizing pretrained models for feature extraction from acoustic data. These features can then be employed with shallow or deep classifiers. Moreover, the incorporation of 24-hour temporal blocks enables extracting temporal patterns during the training process. Specifically, 1D convolutions exploit the recorded information to integrate 24-hour patterns for each feature.
Then, we used the Eigen-CAM technique to provide interpretability to the results obtained for the input features within the 24-hour temporal block with the 1D convolutional neural network, focusing on the 64 one-dimensional convolutional filters designed to capture patterns across a daily cycle. The Eigen-CAM results of Fig. 6 showcase the hours of the day that most contribute to discriminating between classes in both datasets. The plots display the influence of each hour of the day for discriminating among classes based on their acoustic features.
Figure 6a–c show how the proposed 1D convolutional neural network is more influenced by the hour ranges of 17:00–18:00, 11:00–12:00, and 04:00–05:00 for low, medium, and high transformation levels in the Caribbean dataset, respectively. Similarly, different time intervals play a crucial role in classification of the Rey Zamuro dataset: 19:00 and 22:00 for pasture, 20:00 for savanna, and 15:00, 19:00, and 20:00 for forest, as shown in Fig. 6d–f. Although partially coinciding with sunset and sunrise, this is not intuitive information and would lead to further studies to understand, for example, why the forest in Rey Zamuro is better characterized at 3 pm.
These results evidence that Eigen-CAM can provide interpretability to the discriminatory process of the 1D convolutional neural network in distinguishing between levels of transformation/coverage. Moreover, this technique enables visualization of the relevance level associated with each hour of the day within the 24-hour temporal blocks constructed from deep learning-based features. Such interpretations facilitate bridging the gap that may exist between EAI, which intrinsically offers interpretability, and deep embeddings.
5 Discussion
Our approach quantitatively evaluates how well temporal blocks can identify ecosystem patterns by capturing the daily dynamics of soundscapes. Additionally, our methodology facilitates the interpretation of significant temporal patterns within these blocks, pinpointing specific times of day that are crucial for the model’s decision-making process. This capability not only enhances the precision of ecological monitoring but also provides deeper insights into the temporal variability of ecosystems, potentially informing targeted conservation strategies.
In the initial stage of our study, we extracted features using VGGish deep embeddings and compared their effectiveness with EAI, traditionally used in soundscape analysis. Our results demonstrate that deep embeddings provide a more accurate identification of ecological patterns across both study sites, corroborating findings from several state-of-the-art studies that suggest pretrained deep neural networks offer a viable alternative to traditional EAI [59, 24, 41]. Despite these advantages, employing these pretrained models introduces challenges, particularly in data-intensive settings. For example, while VGGish facilitates detailed segment characterization down to approximately 16 ms, this granularity results in an extensive feature set in large-scale applications. To manage this, we averaged features over one or five-minute intervals depending on the study site, which, while necessary to reduce computational load, may lead to significant information loss. This trade-off highlights a critical area for future research: optimizing the balance between detail and manageability in feature extraction for large-scale ecological monitoring.
In the next stage, we used the extracted features to construct 24 temporal blocks designed to represent the daily dynamics of each recording site. These blocks were then used to train a 1D CNN, specifically engineered to leverage this hourly information to identify the labels assigned to each recording site. In contrast, we also fed the same blocks to conventional supervised classification models, such as random forests and neural networks, which do not inherently exploit the input data structure. The results demonstrated a significant improvement with the 1D CNN, underscoring the benefits of appropriately leveraging the temporal structure of the recordings.
Literature has typically included the temporal structure of large-scale soundscape studies as a component analyzed in the post-processing stage [38, 41]. A pivotal finding in our work is that integrating such temporal information directly into the learning process markedly enhances both the performance and interpretability of soundscape analysis. However, a potential limitation in our approach is the use of 1D convolutions instead of more sophisticated models like recurrent neural networks, transformers, or state-space models, which might manage temporal information more effectively. We opted for 1D CNNs due to their simplicity and the fewer parameters required, considering the volume of temporal blocks at hand. Nonetheless, this decision opens up new avenues for future research, particularly in exploring advanced models that could further optimize temporal data handling.
Finally, we employed Eigen-CAM to enhance the interpretability of the proposed 1D convolutional neural network. Specifically, we sought to identify the hours of the day that most significantly contribute to the classification, pinpointing when the soundscapes of different ecosystem types diverge the most. Traditionally, Eigen-CAM has been utilized to elucidate the operational mechanisms of mostly convolutional neural networks by visually highlighting what they capture in images. This has proven invaluable in fields such as medical imaging [60, 61], security [62], and pedestrian identification [63], among others. In our study, Eigen-CAM revealed the critical hourly information within the temporal blocks as interpreted by the 1D convolutional network. Our analysis showed that specific hours not usually associated with peak activity (11:00–12:00 and 19:00–20:00), as well as periods around sunrise (04:00–06:00) and sunset (17:00–18:00), are particularly important for distinguishing between transformation levels in the Caribbean dataset and cover types in the Rey Zamuro dataset. This insight is instrumental for soundscape studies, providing a deeper understanding of patterns that could potentially impact ecosystem health and aiding in the development of targeted conservation strategies.
In summary, our study underscores the significant potential of employing temporal blocks in conjunction with 1D convolutional neural networks to enhance the precision and interpretability of soundscape analyses. By directly integrating temporal information into the learning process, we not only improve model performance but also provide deeper insights into the ecological dynamics at play. The use of Eigen-CAM has been instrumental in highlighting crucial temporal intervals that influence classification outcomes, offering a novel approach to understanding ecological patterns. While our methodology represents a notable advancement in ecoacoustic monitoring, the identified limitations and trade-offs point toward substantial opportunities for future research. Specifically, exploring more sophisticated temporal models could further refine our understanding and management of ecological data. As ecosystems worldwide continue to face unprecedented changes, the methodologies developed in this study offer promising tools for researchers and conservationists to monitor and understand these complex environments more effectively.
6 Conclusion
This study introduces a methodology to highlight ecologically meaningful information from PAM recordings. It has two key characteristics: deep learning-based features as a viable alternative to traditional EAI and enhancing temporal dynamics through 24-hour temporal blocks. The temporal embedding within the proposed blocks proved beneficial for distinguishing between varying levels of ecosystem transformation or land covers. Moreover, employing CAM techniques affords interpretability to the deep embeddings, which, unlike EAI, are not inherently interpretable. As a result, the proposed approach provided crucial time-of-day information to discriminate between classes.
Looking ahead, the exploration of alternative architectures as feature generators is essential. Additionally, the potential of other models, such as recurrent neural networks to leverage the temporal information of soundscapes warrants investigation. Lastly, the application of unsupervised learning models to this data should be considered, given their independence from labeled datasets.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Dias Fábio Felix, Pedrini Helio, Minghim Rosane (2021) Soundscape segregation based on visual analysis and discriminating features. Eco Inform 61:101184
Pijanowski Bryan C, Farina Almo, Gage Stuart H, Dumyahn Sarah L, Krause Bernie L (2011) What is soundscape ecology? An introduction and overview of an emerging new science. Landscape Ecol 26:1213–1232
Odom Karan J, Araya-Salas Marcelo, Morano Janelle L, Ligon Russell A, Leighton Gavin M, Taff Conor C, Dalziell Anastasia H, Billings Alexis C, Germain Ryan R, Pardo Michael et al (2021) Comparative bioacoustics: a roadmap for quantifying and comparing animal sounds across diverse taxa. Biol Rev 96(4):1135–1159
Ross Samuel RP-J, O’Connell Darren P, Deichmann Jessica L, Camille Desjonquères, Amandine Gasc, Phillips Jennifer N, Sethi Sarab S, Wood Connor M, Zuzana Burivalova (2023) Passive acoustic monitoring provides a fresh perspective on fundamental ecological questions. Func Ecol 37(4):959–975
Morrison Catriona A, Ainars Auniņš, Zoltan Benkő, Brotons L, Chodkiewicz T, Chylarecki P, Escandell V, Eskildsen DP, Gamero A, Herrando S et al (2021) Bird population declines and species turnover are changing the acoustic properties of spring soundscapes. Nat Commun 12(1):6217
Vijay Ramesh, Priyanka Hariharan, Akshay VA, Pooja Choksi, Sarika Khanwilkar, Ruth DeFries, Robin VV (2023) Using passive acoustic monitoring to examine the impacts of ecological restoration on faunal biodiversity in the western ghats. Biol Conserv 282:110071
Rappaport Danielle I, Swain Anshuman, Fagan William F, Dubayah Ralph, Morton Douglas C (2022) Animal soundscapes reveal key markers of amazon forest degradation from fire and logging. Proc Natl Acad Sci 119(18):e2102878119
Tucker David, Gage Stuart H, Williamson Ian, Fuller Susan (2014) Linking ecological condition and the soundscape in fragmented Australian forests. Landscape Ecol 29:745–758
Oliveira Eliziane Garcia, Ribeiro Milton Cezar, Roe Paul, Sousa-Lima Renata S (2021) The caatinga orchestra: acoustic indices track temporal changes in a seasonally dry tropical forest. Ecol Ind 129:107897
Fuller Susan, Axel Anne C, Tucker David, Gage Stuart H (2015) Connecting soundscape to landscape: Which acoustic index best describes landscape configuration? Ecol Ind 58:207–215
Burivalova Zuzana, Game Edward T, Butler Rhett A (2019) The sound of a tropical forest. Science 363(6422):28–29
Robinson Jake M, Breed Martin, Abrahams Carlos (2023) The sound of restored soil: Measuring soil biodiversity in a forest restoration chronosequence with ecoacoustics. bioRxiv, pages 2023–01
Eldridge Alice, Casey Michael, Moscoso Paola, Peck Mika (2016) A new method for ecoacoustics? Toward the extraction and evaluation of ecologically-meaningful soundscape components using sparse coding methods. Peer J 6:2016
Sueur Jérôme, Farina Almo, Gasc Amandine, Pieretti Nadia, Pavoine Sandrine (2014) Acoustic indices for biodiversity assessment and landscape investigation. Acta Acust Acust 100(4):772–781
Bradfer-Lawrence Tom, Bunnefeld Nils, Gardner Nick, Willis Stephen G, Dent Daisy H (2020) Rapid assessment of avian species richness and abundance using acoustic indices. Ecol Indic 115(November 2019):106400
Buxton Rachel T, McKenna Megan F, Clapp Mary, Meyer Erik, Stabenau Erik, Angeloni Lisa M, Crooks Kevin, Wittemyer George (2018) Efficacy of extracting indices from large-scale acoustic recordings to monitor biodiversity. Conserv Biol 32(5):1174–1184
Gómez William E, Isaza Claudia V, Daza Juan M (2018) Identifying disturbed habitats: a new method from acoustic indices. Eco Inform 45(May 2017):16–25
Castro-Ospina Andrés E, Rodríguez-Buritica Susana, Rendon Nestor, Velandia-García Maria C, Isaza Claudia, Martínez-Vargas Juan D (2022) Identification of tropical dry forest transformation from soundscapes using supervised learning. In international conference on smart technologies, systems and applications, pages 173–184. Springer
Do Nascimento Leandro A, Marconi Campos-Cerqueira, Beard Karen H (2020) Acoustic metrics predict habitat type and vegetation structure in the Amazon. Ecol Indic 117:106679
Chen Lei, Zhiyong Xu, Zhao Zhao (2023) Biotic sound snr influence analysis on acoustic indices. Front Remote Sens 3:1079223
Metcalf Oliver C, Barlow Jos, Devenish Christian, Marsden Stuart, Berenguer Erika, Lees Alexander C (2021) Acoustic indices perform better when applied at ecologically meaningful time and frequency scales. Methods Ecol Evol 12(3):421–431
Bradfer-Lawrence Tom, Gardner Nick, Bunnefeld Lynsey, Bunnefeld Nils, Willis Stephen G, Dent Daisy H (2019) Guidelines for the use of acoustic indices in environmental research. Methods Ecol Evol 10(10):1796–1807
Heath Becky E, Sethi Sarab S, David Orme CL, Ewers Robert M, Lorenzo Picinali (2021) How index selection, compression, and recording schedule impact the description of ecological soundscapes. Ecol Evolut 11(19):13206–13217
Kate McGinn, Stefan Kahl, Zachariah Peery M, Holger Klinck, Wood Connor M (2023) Feature embeddings from the Birdnet algorithm provide insights into avian ecology. Ecol Inform 74:101995
Sethi Sarab S, Ewers Robert M, Jones Nick S, Sleutel Jani, Shabrani Adi, Zulkifli Nursyamin, Picinali Lorenzo (2022) Soundscapes predict species occurrence in tropical forests. Oikos 2022(3):e08525
Dias Fábio Felix, Ponti Moacir Antonelli, Minghim Rosane (2022) A classification and quantification approach to generate features in soundscape ecology using neural networks. Neural Comput Appl 34(3):1923–1937
O’Mahony Niall, Campbell Sean, Carvalho Anderson, Harapanahalli Suman, Hernandez Gustavo Velasco, Krpalkova Lenka, Riordan Daniel, Walsh Joseph (2020) Deep learning vs. traditional computer vision. In Advances in Computer Vision: Proceedings of the 2019 computer vision conference (CVC), Volume 1 1, pages 128–144. Springer
Padovese Bruno, Kirsebom Oliver S, Frazao Fabio, Evers Clair HM, Beslin Wilfried AM, Theriault Jim, Matwin Stan (2023) Adapting deep learning models to new acoustic environments-a case study on the north atlantic right whale upcall. Eco Inform 77:102169
Quinn Colin A, Burns Patrick, Gill Gurman, Baligar Shrishail, Snyder Rose L, Salas Leonardo, Goetz Scott J, Clark Matthew L (2022) Soundscape classification with Convolutional Neural Networks reveals temporal and geographic patterns in ecoacoustic data. Ecol Ind 138:108831
Çoban Enis Berk, Pir Dara, So Richard, Mandel Michael I (2020) Transfer learning from youtube soundtracks to tag arctic ecoacoustic recordings. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 726–730. IEEE
Zhong Ming, LeBien Jack, Campos-Cerqueira Marconi, Dodhia Rahul, Ferres Juan Lavista, Velev Julian P, Aide T Mitchell (2020) Multispecies bioacoustic classification using transfer learning of deep convolutional neural networks with pseudo-labeling. Appl Acoust 166:107375
Tan Chuanqi, Sun Fuchun, Kong Tao, Zhang Wenchang, Yang Chao, Liu Chunfang (2018) A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27, pages 270–279. Springer
Fairbrass Alison J, Firman Michael, Williams Carol, Brostow Gabriel J, Titheridge Helena, Jones Kate E (2019) CityNet-Deep learning tools for urban ecoacoustic assessment. Methods Ecol Evol 10(2):186–197
Kahl Stefan, Wood Connor M, Eibl Maximilian, Klinck Holger (2021) Birdnet: a deep learning solution for avian diversity monitoring. Eco Inform 61:101236
Hershey Shawn, Chaudhuri Sourish, Ellis Daniel PW, Gemmeke Jort F, Jansen Aren, Moore R Channing, Plakal Manoj, Platt Devin, Saurous Rif A, Seybold Bryan et al (2017) Cnn architectures for large-scale audio classification. In 2017 IEEE international conference on acoustics, speech and signal processing (icassp), pages 131–135
Kim Bongjun, Pardo Bryan (2019) Improving content-based audio retrieval by vocal imitation feedback. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4100–4104
Tsalera Eleni, Papadakis Andreas, Samarakou Maria (2021) Comparison of pre-trained CNNs for audio classification using transfer learning. J Sens Actuator Netw 10(4):72
Sethi Sarab S, Jones Nick S, Fulcher Ben D, Picinali Lorenzo, Clink Dena Jane, Klinck Holger, Orme C David L, Wrege Peter H, Ewers Robert M (2020) Characterizing soundscapes across diverse ecosystems using a universal acoustic feature set. Proc Natl Acad Sci 117(29):17049–17055
Qiu Zhibin, Wang Haixiang, Liao Caibo, Zuwen Lu, Kuang Yanjun (2023) Sound recognition of harmful bird species related to power grid faults based on vggish transfer learning. J Electr Eng Techno 18(3):2447–2456
Sethi Sarab S, Bick Avery, Ewers Robert M, Klinck Holger, Ramesh Vijay, Tuanmu Mao-Ning, Coomes David A (2023) Limits to the accurate and generalizable use of soundscapes to monitor biodiversity. Nat Ecol Evolut 7(9):1373–1378
Wang Mei, Mei Jinjuan, Darras Kevin FA, Liu Fanglin (2023) Vggish-based detection of biological sound components and their spatio-temporal variations in a subtropical forest in eastern china. PeerJ 11:e16462
Simone Cominelli, Nicolo’ Bellin, Brown Carissa D, Valeria Rossi, Jack Lawson (2024) Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal passive acoustic monitoring datasets. Ecol Evolut 14(2):e10951
Krause Bernie, Gage Stuart H, Joo Wooyeong (2011) Measuring and interpreting the temporal variability in the soundscape at four places in sequoia national park. Landscape Ecol 26:1247–1256
Fan Feng-Lei, Xiong Jinjun, Li Mengzhou, Wang Ge (2021) On interpretability of artificial neural networks: a survey. IEEE Trans Radiat Plasma Med Sci 5(6):741–760
Dong Shaojiang, Xia Zhengfu, Pan Xuejiao, Tengwei Yu (2023) Environmental sound classification based on improved compact bilinear attention network. Digital Signal Process 141:104170
Nam Kyun Kim and Hong Kook Kim (2021) Polyphonic sound event detection based on residual convolutional recurrent neural network with semi-supervised loss function. IEEE Access 9:7564–7575
Bo Wu, Zhang Xiao-Ping (2021) Environmental sound classification via time-frequency attention and framewise self-attention-based deep neural networks. IEEE Internet Things J 9(5):3416–3428
Hernández Alma, González Roy, Villegas Felipe, Martínez Sindy (2019) Bosque seco tropical. monitoreo comunitario de la biodiversidad. cuenca río Cañas
Guerrero González Ana María, Pérez Torres Jairo. Estructura y composición del ensamblaje de murciélagos de la reserva natural rey zamuro y matarredonda en san martín, meta, Colombia
Acoustic heterogeneity of tropical dry forest based on identification of landscape transformation Néstor David Rendón Hurtado Universidad de Antioquia Facultad de Ingenieria. (2021)
Gorelick Noel, Hancher Matt, Dixon Mike, Ilyushchenko Simon, Thau David, Moore Rebecca (2017) Google earth engine: planetary-scale geospatial analysis for everyone. Remote Sens Environ 202:18–27
Stowell Dan (2022) Computational bioacoustics with deep learning: a review and roadmap. Peer J 10:e13152
Bishop Christopher Michael, Bishop Hugh (2023) Deep learning - foundations and concepts. 1 edition
Ian Goodfellow, Yoshua Bengio, Aaron Courville (2016) Deep learning. MIT press, Cambridge
Simonyan Karen, Zisserman Andrew (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Gemmeke Jort F, Ellis Daniel PW, Freedman Dylan, Jansen Aren, Lawrence Wade, Moore R Channing, Plakal Manoj, Ritter Marvin (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE
Sebastián Ulloa Juan, Sylvain Haupert, Felipe Latorre Juan, Thierry Aubin, Jérôme Sueur (2021) scikit-maad: an open-source and modular toolbox for quantitative soundscape analysis in python. Methods Ecol Evolut 12(12):2334–2340
Muhammad Mohammed Bany, Yeasin Mohammed (2020) Eigen-cam: Class activation map using principal components. In 2020 International joint conference on Neural Networks (IJCNN), pages 1–7. IEEE
Gibb Kieran A, Eldridge Alice, Sandom Chris J, Simpson Ivor JA (2024) Towards interpretable learned representations for ecoacoustics using variational auto-encoding. Eco Inform 80:102449
Giavina-Bianchi Mara, William Gois Vitor, Fornasiero Paiva Victor, Lissa Okita Aline, Machado Sousa Raquel, Birajara Machado (2023) Explainability agreement between dermatologists and five visual explanations techniques in deep neural networks for melanoma ai classification. Front Med 10:1241484
Prinzi Francesco, Insalaco Marco, Orlando Alessia, Gaglio Salvatore, Vitabile Salvatore (2024) A yolo-based model for breast cancer detection in mammograms. Cogn Comput 16(1):107–120
Thaker Keval, Chennupati Sumanth, Rawashdeh Nathir, Rawashdeh Samir A (2023) Multispectral deep neural network fusion method for low-light object detection. J Imag 10(1):12
Raghavendra S, Abhilash SK, Madhav Nookala Venu, Kaliraj S et al (2023) Efficient deep learning approach to recognize person attributes by using hybrid transformers for surveillance scenarios. IEEE Access 11:10881–10893
Acknowledgements
This work was supported by Universidad de Antioquia, Instituto Tecnológico Metropolitano de Medellín, Alexander von Humboldt Institute for Research on Biological Resources and Colombian National Fund for Science, Technology and Innovation, Francisco Jose de Caldas - MINCIENCIAS (Colombia) [Program No. 111585269779].
Funding
Open Access funding provided by Colombia Consortium.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Castro-Ospina, A.E., Rodríguez-Marín, P., López, J.D. et al. Leveraging time-based acoustic patterns for ecosystem analysis. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-10157-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00521-024-10157-7