1 Introduction

With the advent of neural computing, sound classification systems have seen increasing use in the real world with a wide range of applications [1, 2], including in multimodal systems [3]. A practical application of these systems is to recognize health-related sounds like coughing and sneezing that can supply insights into the general well-being of each individual [4]. Such techniques can also detect significant lung sound anomalies such as wheezing and crackling, which previously could only be done by highly-skilled clinicians [5, 6]. Moreover, deploying machine-based sound surveillance systems is crucial as manufacturing and production lines become more autonomous towards predictive machine maintenance, boosting production quality and yield, and reducing downtime and costs [7]. Another imperative use case is automated environmental sound monitoring, such as ocean habitats [8] and urban noise sources. These analyses can provide a more detailed understanding of high pollution noise levels that affect our well-being [9].

Fig. 1
figure 1

We group each audio sample according to their grouping attribute, or their original source in this figure. Then, NeuProNet is employed to learn from multiple samples at once and extract the profile embedding that represents each group’s unique and distinctive characteristics

Recent methods using neural networks have allowed efficient, cost-effective, and timely classification of various sound categories [10, 11]. Nevertheless, there are many drawbacks concerning the current state of the research direction. Whereas previous works have shown the advantages of applying machine learning techniques and deep neural networks to the sound domain, several insightful aspects of sound are yet to be investigated. First, sound signals can carry noisy and biased features of each original source and recording environment. These factors can affect performances since models learn mixed characteristics linked to source information under diverse categories rather than downstream features [12,13,14]. Second, many existing methods consider each sound sample separately, even though they may come from identical sources, highly similar settings, or with the same recording backgrounds. Failing to utilize these signals in algorithms is more likely to lead to models missing complement features shared by the same provider and also being confused by irrelevant markers.

Prior works attempt to alleviate this by building various versions of the same systems for each specific device or patient [13, 15,16,17]. However, these methods have several limitations that may affect their applicability and effectiveness. First, they often rely on learning from each individual’s data only, without information and weight sharing across different groups or individuals, requiring multiple training stages [13, 15, 16] and a satisfactory amount of data for each group [17]. Second, the learning capability and data quality of each group vary, which may lead to bias and fairness problems for some specific sets of groups. A third drawback is that existing methods are susceptible to reliability issues when a new user enters the system, as they may lack sufficient data or prior knowledge to provide accurate and relevant decisions [16]. To address existing limitations, we focus on a unified and end-to-end neural network solution that enjoys the benefit of personalization through neural profiling. As depicted in Fig. 1, we propose neural networks that learn from multiple audio samples that share a set of common attributes, instead of learning from each sample individually. These attributes can be based on the source, condition, location, or environment of the sound recordings. For example, it is feasible to group sounds from similar machines or humans, sounds from various conditions or backgrounds, or sounds from different places or situations.

Our main contributions are listed as follows:

  • We propose novel neural profiling networks named NeuProNet capable of learning high-level personalized features through contrastive learning and in-batch profile grouping mechanisms. NeuProNet are designed to learn and extract reliable and robust profile embeddings from recordings in linked groups based on profile awareness and attention pooling.

  • We develop an end-to-end framework to handle and demonstrate the effectiveness of NeuProNet when plugged with various backbone architectures on downstream audio-based tasks, including environmental sound classification and human vocal sound detection.

  • We carry out extensive experiments on multiple benchmark datasets and tasks. We found that models under the guidance of neural profiling networks achieve significant performance gaps across all tasks. Particularly, our experimental results surpass recent state-of-the-art (SoTA) approaches with statistically significant improvements on UrbanSound8K, and VocalSound dataset with a substantial gap in benchmarking metrics, up to 20.19% in accuracy on UrbanSound8K.

  • Source codes of our framework and reproducible baselines are made publicly available at https://github.com/ReML-AI/NeuProNet for future research and benchmarking purposes.

2 Background

In this section, we survey existing methods related to our works, summarize their strengths and weaknesses, and emphasize the novelty of our approach.

2.1 Sound & audio classification

In the general domain of audio pattern recognition, traditional systems comprised classical, statistical or machine learning models such as support vector machine [18]. Those systems would have time-frequency representations, e.g., log mel-spectrogram or mel-frequency cepstral coefficients, as input. This representation of data is also dominantly used for a large number of neural network techniques following the traditional machine learning approaches. Convolutional neural network-based systems are the most widely used [19,20,21] and outperformed recurrent neural network-based systems [22]. Recently, there has been a new direction of applying transformers with high complexities to this research domain. These transformer-based models enjoy flexibility in various types of input features, either directly as raw waveforms or the two-dimensional time-frequency representations [23, 24].

Since the amount of data for these types of datasets is typically small for deep learning, transfer learning and pre-training have also been applied [25, 26]. These techniques allow models to scale up to tens of millions of parameters. However, this also leads to massive computations, multiple training stages, and difficulty in representation learning for systems with real-world usages. Differing from them, our proposed mechanism that simultaneously learns personalized features for each origin-based group of samples with the typical downstream feature learning process alleviates the need for transfer learning. The method enjoys comparative performances and can be trained more efficiently and in an end-to-end manner.

The two sub-fields of environmental sound classification and vocal sound detection have seen a similar trend, as discussed above. However, most of the previous works only focus on either type of task. To our best knowledge, this is one of the pioneering works that bridge the gap between the two sub-domains.

2.2 From personalization to profiling

Personalization is successfully applied in numerous areas, such as in recommender and search systems [27] or medical applications [28]. In sound processing research, most personalized profiling works focus on speech personalization, including speech emotion recognition [29], speech enhancement [30], or humor detection [16]. A typical personalized machine learning method leverages additional cues, such as metadata related to each user. One other primary direction utilizes additional samples of the same user as the user embedding. For instance, Triantafyllopoulos et al. [29] proposed an enrolment-based method that learns on other voice samples of the present speaker to modify the main encoder’s output for emotion recognition. In [31], the authors leveraged the speaker’s noisy data that is commonly disregarded as pseudo-sources for a speech enhancement model. This work follows the second approach to fulfill privacy compliance without utilizing any additional metadata. Moreover, our profiling framework is applied to sound sources of both humans and surrounding habitats.

Fig. 2
figure 2

Our neural profile learning framework. Batched-input spectrograms of shape \(B\times T \times F\) are input to both branches of the framework simultaneously. The first branch leverages a common approach in sound classification, typically using a CNN-based or transformer-based model. The second branch is our method of profile learning, NeuProNet. The final output of the two branches is concatenated together (depicted by the \(\oplus \) operator) before being passed to a classification layer

To the best of our knowledge, in sound pattern recognition, the work from Dang et al. [32] is the most closely connected study. Using samples from each user over time as a time-series and feeding them into an LSTM network, they examined the possibility of longitudinal audio samples over time for COVID-19 progression and recovery trend prediction. Regardless, they only worked on a private dataset with a binary task of COVID-19 detection from coughs. In this work, we develop a neural profiling network that can perform efficiently on a wide range of sound-related tasks.

3 Methodology

In this section, we present our proposed framework for neural profiling. The ultimate goal of our approach is to push the boundary of any neural network backbone on the downstream task by grounding each prediction to its respective profile embedding. As shown in Fig. 1, the framework consists of these steps:

  • Profile linking/grouping: from each grouping attribute, such as recording device, speaker identifier, or recording environment, we gather audio samples for each source. Differing from prior works which deploy multiple grouping attributes, including sensitive ones, e.g., gender and age, to train models, we extract each source’s profile embedding directly from sound samples, thereby profiling without sensitive information.

  • Profile learning: deep neural networks are designed to learn and extract feature-rich profile embeddings from multiple samples in linked groups with contrastive and classification objectives.

  • Profile extraction: our framework allows an end-to-end integration into any existing sound classification works for training and predicting with profile features in conjunction with audio features. Our approach improves neural computing systems even when the amount of available data is small or moderate.

This is enabled through the most essential component of the framework, NeuProNet, our neural network architecture developed to learn high-level profile representation through a candidate set of multiple audio samples. The overall architecture is depicted in Fig. 2.

3.1 Problem definition

Formally, define \(\mathcal {D} = {(X_i, {\text{g}}_i, y_i)}_{i=1}^{N_\mathcal {D}}\) as a dataset of \(N_\mathcal {D}\) samples that are drawn i.i.d. from a data distribution. Each sample consists of an audio recording, denoted by \(X_i \in \mathbb {R}^D\) for its raw waveform representation, or \(X_i \in \mathbb {R}^{T\times F}\) for its spectrum form (e.g., spectrogram, which we utilize in this work). Additionally, each sample includes a group indicator \({\text{g}}_i\in [\text{G}]\), where \(\text{G}\) indicates the number of sources of data, and a class label \( y_i \in \mathbb {R} \).

Typically, a learning algorithm adapts a generic classifier \(h_{\theta }: \mathbb {R}^{T\times F} \rightarrow \mathbb {R}\) that maps from input variable \(X_i\) to class label \(h_\theta (X_i)\). After the training phase, \(h_\theta \) is able to perform a classification task with an empirical loss (or risk) on \(\mathcal {D}\) as:

$$\begin{aligned} L(h_\theta , \mathcal {D}) = \frac{1}{N_\mathcal {D}} \sum _i^{i=N_\mathcal {D}} l(h_\theta (X_i),y_i) \end{aligned}$$
(1)

where \(N_\mathcal {D}\) denotes the total number of samples in the dataset \(\mathcal {D}\).

Each individual who contributes to the final dataset expects the algorithm to tailor to them with respect to their group of samples and in turn earn a gain in performance. A tailored classifier \(h_\theta ^{\text{g}}: \mathbb {R}^{T\times F} \times [\text{G}] \rightarrow \mathbb {R}\) that uses group attributes in the best case will have:

$$\begin{aligned} L(h_\theta ^{\text{g}}, \mathcal {D}_{\text{g}}) =&\frac{1}{N_{\text{g}}} \sum _{i:{\text{g}}_i={\text{g}}} l(h_\theta ^{\text{g}}(X_i, {\text{g}}_i), y_i ) < L(h_\theta , \mathcal {D}_{\text{g}}), \nonumber \\&\forall {\text{g}} \in [\text{G}] \end{aligned}$$
(2)

where \(N_{\text{g}}\) denotes the number of samples belonging to group \({\text{g}}\).

Leading to:

$$\begin{aligned} L(h_\theta ^{\text{g}}, \mathcal {D}) < L(h_\theta , \mathcal {D}) \end{aligned}$$
(3)

This means that by incorporating additional group features into the learning process, the classifier achieves better overall performance on the dataset \(\mathcal {D}\).

In conventional practice, a set of model parameters for each group is trained particularly on that group’s dataset. This forces the same model architecture to be saved in several iterations, and each model is thus unable to make use of a larger dataset. Another drawback is that sensitive metadata (such as sex or health condition) must be used in order to understand each group’s features successfully. As discussed in the following sections, we proposed a neural profile learning framework that enjoys the advantage of profiling in one single unified model based only on sound data without sensitive metadata.

3.2 Neural profiling network

In our proposed framework for neural profiling, the goal is to learn unique features for each group of samples. This is achieved through the neural profile learning mechanism by analyzing each group’s provided samples, which are called candidate samples with respect to the current sample for the downstream task. As mentioned in the previous section, an effective analysis of the candidate set can help a tailored classifier gain better performance.

Our proposed approach leverages an additional neural network that is dedicated to learning from groups of samples. It differs from the backbone branch where each input sample is computed separately. By introducing another branch to the system concurrently with the backbone branch, the backbone network with the existing structure and optimal settings can remain intact and be distilled from the neural profile learning process. Additionally, our approach maximizes embeddings learned and shared by the backbone branch with its dedicated neural architecture. This flexibility allows any backbone network to be plugged into the system.

Moreover, we also analyze deeply and realize that the candidate set may come from any class of the target task, and if handled poorly, it can contribute additional noise to the model. An ineffective profile learning strategy would confuse the model and negatively impact performance. Previous works tend to require another external dataset as the set of candidate samples to avoid confusion. To address this issue, we propose extracting these tailored features directly from input samples rather than collecting them from another corpus, which is a common practice in previous works. This approach reduces the need and cost to collect additional data and allows for more efficient learning of new representations from the same data. Intuitively, by combining embeddings learned by two branches in the framework, we can potentially achieve better performance by leveraging the strengths of each representation and compensating for their weaknesses, as each representation captured certain aspects of the data.

Feature vector extracted by the backbone network in the traditional manner:

$$\begin{aligned} f^\text{b}_i = h_\theta ^\text{b}(X_i) \end{aligned}$$
(4)

Combine with the neural profiling branch:

$$\begin{aligned} f^\text{p} = \frac{1}{N_\mathrm {g_i}} \sum _{j:{\text{g}}_j={\text{g}}_i} h_\theta ^\text{p}(X_j) \end{aligned}$$
(5)

Thereby, the final prediction for sample i is described below:

$$\begin{aligned} \hat{y_i}=\mathcal {W}(f^\text{b}_i \oplus f^\text{p}_i) \end{aligned}$$
(6)

where \(\mathcal {W}\) denotes a linear classifier head and \(\oplus \) denotes the concatenate operation. This allows the two branches to learn and update concurrently. Each branch is proposed to learn from different input data: the backbone network is designed to learn signals from an audio sequence, where NeuProNet is proposed to learn from a group of multiple audio sequences. Therefore, deploying both branches concurrently instead of stacking enables the backbone network to retain its original architecture while introducing a separate network for neural profiling. We can utilize the benefits of both branches for the downstream task, as the representations learned from the backbone branch and the profile learning branch are different and can complement each other. Whereas there are studies on more sophisticated methods for fusing representations between multiple branches [33], we employ a simple concatenation operator, as we aim for our proposed NeuProNet to be a new SoTA baseline and as easy as possible to be applied to any pre-existing networks. Nevertheless, our experiments and analyses still show good signals from both branches and we expect performance can be further improved if more advanced fusion techniques are applied. In our analysis of the worst-case scenario, the model will still be able to perform adequately by focusing on the backbone based on our fully connected linear mechanism.

Our framework enables both neural network branches to extract separate representations and to be trained concurrently from the classification tasks. To help information sharing between the two branches, we derive the use of a softmax attention pooling mechanism, where softmax scores of dot products between the sample’s embedding from the backbone branch to its corresponding candidate set’s profiling features (profile-aware). This allows the system to filter out noisy features and highlight valuable information from the neural profiling branch, improving not only the profiling learning task but also the backbone branch’s learning process:

$$\begin{aligned} f^\text{pre}_i&= h_\theta ^\text{p}(X_i) \nonumber , \\ f^\text{p}_{i}&= \sum _{j:{\text{g}}_j={\text{g}}_i} \frac{\text{exp}(f^\text{pre}_j \cdot f^\text{b}_i / \tau )}{ \sum _{k:{\text{g}}_k={\text{g}}_i} \text{exp} (f^\text{pre}_k \cdot f^\text{b}_i / \tau ) } f^\text{pre}_j \end{aligned}$$
(7)

where \(\text{exp}\) is the exponential function, and \(\tau \) is the temperature hyperparameter.

Moreover, while we design our framework to be able to plug into any backbone architectures, NeuProNet leverages a Transformer-based network for the neural profiling branch. This is due to the highly efficient attention mechanism of the transformer model and its freedom from inductive biases. Additionally, for learning meaningful profiles, we employ a supervised contrastive learning loss [34], where the labels used to group positive and negative samples are obtained through group attributes, e.g., patient ID, recording environment, etc., where negative samples are not chosen from samples of the same group so they can be pushed closer together.

$$\begin{aligned} L_\text{CL} = -\log \frac{\text{exp}(\text{sim}(f^\text{pre}_i, f^\text{pre}_{j, {\text{g}}_j = {\text{g}}_i}))}{ \sum _{k:{\text{g}}_k \sim [\text{G}] \backslash {\text{g}}_i} \text{exp} ( \text{sim} (f^\text{pre}_i, f^\text{pre}_k)) } \end{aligned}$$
(8)

where \(\text{sim}\) is a similarity function such as the cosine similarity.

The model is trained in an end-to-end manner through back-propagation via a cross-entropy loss function. Both branches can be updated through gradient descent.

$$\begin{aligned} L_\text{CE} = \sum _{i} y_i\log (\sigma (\hat{y_i}))+(1-y_i)\log (1-\sigma (\hat{y_i})) \end{aligned}$$
(9)

where \(\sigma (\cdot )\) is known as the activation function.

The final loss function for the NeuProNet branch is:

$$\begin{aligned} L_{NeuProNet} = \lambda L_\text{CE} + (1-\lambda ) L_\text{CL} \end{aligned}$$
(10)

where \(\lambda \) is treated as a tuned hyperparameter.

Note that in the final system, however, one would retrieve candidate samples from the entire dataset or another corpus, leading to slow training and inference time. To alleviate this, we propose to group samples solely on samples available in the current batch of input data. The candidate set, on average, may be reduced in size. However, our ablation study shows that with this mechanism, our proposed approach achieves the same level of performance while enjoying sufficient speed in both stages.

4 Experiments

This section details the experimental procedures, datasets, evaluation metrics, and implementation designs utilized to evaluate the proposed network’s performance and benchmark it against baselines and SoTA approaches.

4.1 Datasets and evaluation metrics

We validate the proposed framework on two different publicly available sound classification datasets, namely UrbanSound8K and VocalSound. A detailed description of these datasets is presented below.

4.1.1 UrbanSound8K

The UrbanSound8K dataset [35] includes 8732 mono and stereo samples divided into ten categories: air conditioner, car horn, children playing, dog barking, drilling, engine idling, gunfire, jackhammer, siren, and street music. The number of samples in each class is varied, making this an imbalanced dataset. Moreover, the average recording durations for each class are not distributed evenly throughout the courses. Each track may be up to 4 s long and has a native sampling rate ranging from 16 to 48 kHz. As a result, the collection provides a more accurate reconstruction of daily ambient noises, including both machine- and human-based sounds. Other urban datasets are less suited since they are more predictable and less “naturalistic”.

We construct each group, or each candidate set, as samples from the same original recorded audio before they are split, as indicated in the metadata contained in the dataset. While the original Freesound recording contains signals from different classes at different time stamps, each split audio provided in the dataset consists of a signal for the class and various background noises related to each recording environment, i.e. inside or outside a house. These noisy signals can be common between each group’s samples.

The authors of the dataset separated it into ten folds, which we employed in this study to conduct our evaluation. As discussed in previous works [35, 36], to avoid data leakage, it is of crucial importance to follow the official split. Therefore, we only consider prior works that perform experiments on the official split of UrbanSound8K. We do not perform comparisons with customized splits such as in [37] because those splits are not reproducible. Moreover, while our framework is not limited to transfer learning, pre-training, or multimodal training paradigms, we focus on fair comparisons with other end-to-end methods on audio data.

As some classes in the dataset have fewer samples than others, we report both accuracy and AUC, along with other metrics, to account for the issue of data imbalance.

4.1.2 VocalSound

VocalSound [12] is a publicly accessible dataset containing 21,024 crowdsourced audio recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3365 unique persons. Meta information about each speaker, namely anonymous speaker label, speaker age, gender, native language, nationality, and health status, are also included. The dataset focuses on research on removing bias on various speaker groups, as the authors acknowledged an EfficientNet [38]-based model has biases among certain groups.

We leverage group samples provided by each person as a candidate set. In general, each speaker provided around six non-speech sounds, each belonging to a separate class. This setting also fits with applications such as patient health monitoring, where anomaly signals like cough and sneeze can be detected over a duration span and can be used as indicators of the general well-being of individuals.

We follow the evaluation strategy of the original paper, where the dataset is split into three sets, training, validation, and testing set. Additionally, we repeat experiments on each model three times and report the mean and standard deviation of evaluation statistics, with accuracy as the main benchmarking metric.

4.2 Benchmarking methods

As we are interested in the efficacy of the neural profiling framework, we compare various backbone networks before and after NeuProNet is plugged in. These implementations are heavily inspired by SoTA audio classification models, given their popularity and ease of reproducibility.

  • Resnet: we deploy a vanilla Resnet18 [39] model, a variant of the CNN which uses residual blocks. While originally proposed for tasks in the computer vision domain, recent research has focused on effectively applying these types of neural networks with residual connection to the domain of audio signal processing through the spectrogram representation of sound.

  • EfficientNet: EfficientNet-B0 [38] is a recently proposed convolutional neural network that is also based on an efficient residual convolution block. EfficientNet has the ability to scale the network on all dimensions (width, depth, and input resolution). This model has been shown to obtain high performances in sound classification tasks.

  • Transformer: there has been a surge of research in applying transformer architecture for audio data. In this work, we use a model that stacks multiple layers of transformer-encoder blocks together.

Moreover, we also compare results with previous SoTA methods that performed benchmarking on the same (fair) data splits. Our framework not only achieves significant gaps in performances across all tasks and datasets compared to backbones without NeuProNet, but it also allows these naive backbone architectures to obtain comparable or even better performances than previous works.

In order to assess the proposed approach and compare it with the existing works, accuracy is our primary evaluation metric. Other standard evaluation metrics such as precision, recall, F1, and AUC are also reported for future benchmarking purposes.

4.3 Training and implementation details

The proposed framework was developed in PyTorch, and experiments were carried out on an NVIDIA A100 GPU. Detailed information about our hyperparameter settings is provided in Table 1. These values are chosen based on previous studies and were kept consistent across the training baselines, whether with or without NeuProNet. Regarding the \(\lambda \) value in 10, it is set to 0.9999 for experiments on UrbanSound8K or to 0.99 on VocalSound, based on the scale of the contrastive learning loss on each dataset. The other values are set empirically through grid search. As mentioned in Sect. 4.1, we adhere to the official data splits and benchmarking strategy specified by the authors of each dataset. To ensure fair comparisons, we also adhered to the data augmentation techniques and weight imbalancing methods used in previous works. All model trainings were completed in under 2 h, and there was no additional overhead in training or inference time when NeuProNet was integrated. This is because NeuProNet operates in parallel with the backbone branch.

Table 1 Hyperparameters used in the experimental setup

5 Results and discussion

5.1 Results on UrbanSound8K

In our study, we perform experiments both with or without augmentation to ensure reliability and reproducibility; because some concerned data augmentation can greatly reduce the reproducibility of the processes. The results are illustrated in Tables 2 and 3.

Table 2 Comparison of results against various baseline methods on UrbanSound8K for settings without augmentation
Table 3 Comparison of results against various baseline methods on UrbanSound8K for settings with augmentation

Popular backbones (e.g., EfficientNet, Resnet, and Transformer) are thoroughly investigated from the public domain of existing deep learning approaches, without any customization for the audio domains. Without the use of NeuProNet, these backbones individually achieve lower performances than the current SoTA without augmentation. Therefore, our NeuProNet approach demonstrates great improvements with large gaps, regardless of the backbone’s characteristics and architecture. The reason for this reliability and robustness is due to our neural profiling mechanism that helps learn novel representations of data in addition to information learned by the backbone branch. Notably, we observe a massive boost in the accuracy of the EfficientNet model, up to 20.19% compared against backbones without the neural profiling network. This was proved to be statistically significant in our neural profiling method (two-tailed t-test; \(p < 0.01\)). Furthermore, when compared with SoTA solutions, our NeuProNet approach allows simple backbone architectures to outperform all other approaches on the dataset without any specific modifications and achieve a gap of 5.92% between the EfficientNet method with neural profiling network and previous SoTA—AUCO Resnet [42]. The accuracy of 83.75% of NeuProNet+EfficientNet is also comparable with the SoTA method with augmentation. Our framework accomplishes these novel results while maintaining lower complexity in terms of the number of trainable parameters.

Fig. 3
figure 3

Visualization of weights of the linear layer immediately after feature’s concatenation between the two branches in our framework for experiments on UrbanSound8K

Fig. 4
figure 4

t-SNE visualizations of learned embeddings on UrbanSound8K dataset. Legends on the corner denote the class labels. The embeddings of the backbone branch are illustrated in the left panel, and the embeddings of the neural profiling branch are illustrated in the middle panel. Finally, concatenated embeddings of the two branches are visualized in the right panel

With data augmentation, the training process becomes even more costly. Following common approaches, we deploy SpecAugment [43] as our augmentation method in training time. SpecAugment is used for frequency and time masking on spectrograms of training audio clips. It can be seen that augmentation creates perplexities in recognizing the right patterns of classes, leading to degraded performances in 2 out of 3 backbone models (Resnet and Transformer). However, with the addition of the neural profiling network, we achieve even higher performances for both NeuProNet+Resnet and NeuProNet+Transformer with respect to the setting without augmentation. This indicates the robustness against random added noises and confusions of our proposed mechanism, where traditional backbones failed. Consistent with results without augmentation, all three backbones achieve a substantial gap when NeuProNet is plugged in, up to 14.07%. Moreover, the highest accuracy of 83.34% obtained on the Transformer model with NeuProNet is comparable with other SoTA results, with a much less powerful backbone architecture and more straightforward augmentation strategy, allowing for better reproducibility.

We visualize the weights of the first linear layer that takes the concatenation feature vector of the backbone branch and NeuProNet branch to analyze the contribution of each branch to the downstream task. Figure 3 shows the heatmap of the layer on the left, and the frequency distribution of the weight values, split by weights that are directly related to each branch. Looking at the heatmap, we can see that weights associated with NeuProNet are reasonably more activated, and the frequency distribution indicates that features learned by NeuProNet have higher effects, with more weights of the linear layer associated with NeuProNet have values near the tail of the distribution than that of the backbone branch. This is proof that NeuProNet has a greater impact on the framework’s final decision than the backbone.

All the results indicate the efficacy of the neural profiling network in surpassing previous methods through learning new information and signals from the candidate set for environmental sound classification on UrbanSound8K. We also compute other metrics among the benchmarking metric (accuracy), namely precision, recall, F1 score, and AUC ROC. The results are reported in Tables 2 and 3 for settings without or with augmentation, respectively. Generally, our proposed neural profiling network helps obtain better performances across all metrics with large gaps. This proves the generalizability of our method without overfitting into the benchmarking metric.

Table 4 Comparison of other metrics besides the benchmarking metric (accuracy) on VocalSound for settings without augmentation
Table 5 Comparison of results on VocalSound for settings with augmentation

To gain a better understanding of the learned embeddings of each branch and the concatenated embeddings, we used t-distributed stochastic neighbor embedding (t-SNE) to obtain a two-dimensional visualization, as shown in Fig. 4. It can be seen clearly that the embeddings of the neural profiling branch provide different signals that are clearer and stronger than those of the backbone branch. While embeddings of the backbone architecture have shown signs of clustering into each class, each cluster is sparse and tends to overlap. However, the concatenation embeddings of the backbone and neural profiling branches demonstrated a greater distance between each cluster and each one of them is denser than previously. This allows for more reliable performance in classification.

5.2 Results on VocalSound

As we analyzed in the prior section, this dataset may introduce a level of confusion to typical personalization strategies, where the candidate set is drawn from different classes than the input sample. We are interested in how our framework overcomes these hardnesses of the dataset. Results are presented in Tables 4 and 5. To our best knowledge, this is one of the first works to perform benchmarking on this dataset. Therefore, in this section, we will focus on a comparison with the baseline method accompanied with the dataset, which is based on EfficientNet-B0 architecture for their efficiency, as indicated in Sect. 4.2. We deploy a re-implementation of their method for fairer comparisons. We follow the original training setup of the dataset’s paper, using SpecAugment as the augmentation strategy. Moreover, we also conduct training experiments without augmentation for comparisons on more scenarios.

With respect to training without augmentation, as indicated in Table 4, performances of backbone methods without neural profiling mechanism are varied compared to that of training with augmentation. While the accuracies of Effnet and Resnet models decreased, the accuracy of the backbone Transformer model increased. However, NeuProNet still maintains a positive gap when they are plugged in, up to 1.30% in the case of the Transformer-based model. The improvement earned by plugging in NeuProNet when training with augmentation is comparable to training without augmentation. This is proof that our proposed approach is more robust and not affected by augmentation, as seen in backbone networks. These phenomena are also seen in Table 4. A potential explanation is that NeuProNet is capable of learning valuable and robust embeddings from the candidate set to support decisions in the downstream task.

On setting with augmentation, we achieve a new SoTA result with a high accuracy score of 93.98% or 93.91% when neural profiling networks are combined with Resnet or Effnet backbone, respectively. Our highest accuracy score beats the previous baseline of the dataset by 3.48%, indicating the effectiveness of the proposed profiling strategy. Table 5 also clearly shows that our proposed solutions with NeuProNet outperform their backbone versions consistently across all metrics and across all backbone model architectures. The most significant difference is seen when transformers are chosen as the backbone network, where our proposed neural profiling network helps improve performance by 2.36% in accuracy score, from 86.06 to 88.42%. This result is statistically significant (two-tailed t-test, \(p < 0.05\)).

One observation is that the improvement of NeuProNet is more significant on UrbanSound8K than VocalSound. It suggests that the hardness of VocalSound, where each candidate set is drawn from different classes than the input sample, affects the features learned by the neural profiling network to support downstream tasks. In other words, it is harder to extract useful information from the candidate set related to the current sample’s label when they are drawn from different classes. Moreover, the performances of baseline and backbones on each dataset indicate that VocalSound might be easier to solve than UrbanSound8K, and embeddings learned by the backbone models are able to incorporate more information related to the task. Therefore, the embeddings learned by the neural profiling branch are able to help in fewer cases. This is also verified by looking at the visualization of weights of the linear layer with input as the concatenation of embeddings learned by the two branches, illustrated in Fig. 5. We can see that the distribution of weights dedicated for each branch is more similar to each other, and the heatmap shown on the left is more evenly distributed than on UrbanSound8K, indicating that both branches contribute more equally to the final prediction.

Fig. 5
figure 5

Visualization of weights of the linear layer immediately after feature’s concatenation between the two branches in our framework for experiments on VocalSound

Furthermore, we investigate possible bias in our profiling networks, as there has been an ongoing concern of bias in the evaluation of classification results with respect to each speaker group [12]. Table 7 shows that their method was biased towards younger audiences and towards female over male speakers. Nevertheless, our proposed neural profile learning approach takes into account each speaker’s individual profile based on the candidate set, allowing all speakers to benefit from the system regardless of age or gender, as shown in Tables 6 and 7. The implications of those experiments are twofold. First, between each pair of the same backbone network architecture, the system with neural profiling networks obtains better performance across all groups of speakers. Secondly, the variance between each set of groups is much lower for solutions with neural profiling networks, with the most significant gap in the Transformer model for the setting with augmentation, from 1.94 to 0.16 variance between age groups, and from 2.69 to 1.60 for gender groups. This indicates that NeuProNet helps to achieve more reliable and fair classification results.

The above insights evidently show that, in general, NeuProNet helps obtain better performance, even though confusion in the candidate set may affect the performance of the proposed network. Nevertheless, we show that our proposed neural profiling network always improves performance regardless of datasets and models.

Table 6 Comparison of accuracy on each sub-group of users that contributed samples to the test set on VocalSound for settings without augmentation
Table 7 Comparison of accuracy on each sub-group of users that contributed samples to the test set on VocalSound for settings with augmentation

We also use t-SNE to visualize learned embeddings of each branch and concatenated embeddings on VocalSound dataset, as illustrated in Fig. 6. Similar to the results on UrbanSound8K, in the embeddings of the backbone architecture, each class’s cluster still has overlapping areas. However, with the support of the neural profiling framework, the concatenation embeddings have highly separated and denser clusters.

Fig. 6
figure 6

t-SNE visualizations of learned embeddings on VocalSound dataset. Legends on the corner denote the class labels. The embeddings of the backbone branch are illustrated in the left panel, and the embeddings of the neural profiling branch are illustrated in the middle panel. Finally, concatenated embeddings of the two branches are visualized on the right panel

5.3 Ablation study

In the ablation study, we carry out experiments to further analyze and understand the effectiveness of the proposed NeuProNet. Every component of the framework is plugged out one after another, and the results are illustrated in Table 8. Here, we show results with Transformer as the backbone branch for experiments on both UrbanSound8K and VocalSound, but the results are generalized across different backbone architectures. While it is clear that all modules contributed to the gain in performance of the proposed framework, the profile grouping mechanism, or the method to ensemble features from the candidate set, is the most valuable component. On the other hand, the setting without the grouping mechanism achieves equal performance to the backbone network. This means that in the case of unavailable group labels, the framework does not hurt the model’s performance.

Table 8 Results of ablation study where we set out to understand contributions of each component of our proposed NeuProNet framework

6 Conclusion

Overall Conclusion. We propose a novel method for sound classification, namely NeuProNet, a neural profiling network capable of learning latent profile representation shared between audio samples from identical sources. Differing from prior works, we explore the capability of neural networks in a new learning regime, with an in-batch profile grouping mechanism and contrastive loss on group attributes to learn high-level profile embedding. NeuProNet can be plugged in with any backbone architecture, e.g., EfficientNet, Resnet, and Transformer, as well as, trained in an end-to-end manner. Through extensive experiments and analyses, we show that NeuProNet learn and extract robust and reliable features, obtaining high performances consistently across different backbone networks and datasets. We evaluate our framework on multiple evaluation metrics, namely accuracy, precision, recall, F1 score, and ROC AUC. In particular, we achieve substantial improvement up to 5.92% in accuracy compared to the previous SoTA approach and up to 20.19% compared to the baseline. Moreover, by producing profile representation, NeuProNet allows all sound sources, e.g., speakers, to benefit from the system, regardless of their meta attributes, such as gender or age, resulting in reliable and fair classification results that surpass bias problems in previous approaches. Future works may be interested in further exploring and analyzing latent profile characteristics or more advanced fusion methods between backbone features and profiles to be less dependent on backbone performances. Another interesting research direction is to apply training paradigms such as pre-training and semi-supervised learning to learn profile features from large and unlabeled data sets.

Practical Implications. The proposed NeuProNet and neural profiling framework have the potential to significantly improve the performance and efficiency of various sound-related applications, benefiting a wide range of industries and domains, such as acoustic system monitoring and surveillance, environmental sound classification, and sound anomaly detection. By extracting high-level profile representation for each machine based on their id (specific-unit) or machine class (specific type), NeuProNet helps achieve more accurate performance in device anomaly detection, reducing false alarms and maintenance costs. Furthermore, our framework can also help in sound surveillance systems. As more samples become available, the longer the system runs, profile representation extracted by NeuProNet becomes more accurate and robust to noise, leading to even higher performance in downstream tasks. Overall, the proposed framework has the potential to enhance the reliability and effectiveness of sound-related applications across various domains.

Broader Impact. In this work, we develop a neural profile learning framework that allows for any backbone network to be combined with our proposed neural profiling network and gain significant improvement in various sound-related tasks. Moreover, we also show that these improvements are consistent and scalable regardless of backbone architecture or dataset. In other words, any pre-existing systems can be plugged into our framework and enjoy robust and high performances without consequences.