1 Introduction

The field aiming at assessing the emotional content of generalized sounds including speech, music and sound events is attracting the interest of an ever increasing number of researchers [12, 15,16,17, 21, 25]. However, there is still a gap regarding works addressing the specific case of soundscapes, i.e. the combination of sounds forming an immersive environment [20]. Soundscape emotion prediction (SEP) focuses on the understanding of the emotions perceived by a listener of a given soundscape. These may comprise the necessary stimuli for a receiver to manifest different emotional states and/or actions, for example, one may feel joyful in a natural environment. Such contexts demonstrate the close relationship existing between soundscapes and the emotions they evoke, i.e., soundscapes may cause emotional manifestations on the listener side, such as joy. That said, SEP can have a significant impact in a series of application domains, such as sound design [18, 22], urban planning [3, 24], and acoustic ecology [4, 11], to name but a few.

Affective computing has received a lot of attention [9] in the last decades with a special focus on the analysis of emotional speech, where a great gamut of generative and discriminative classifiers have been employed [21, 28], and music [7, 26] where most of the research is concentrated on regression methods. The literature analyzing the emotional responses to soundscape stimuli includes mainly surveys requesting listeners to characterize them. The work described in [1] details such a survey aiming to analyze soundscapes categorized as technological, natural or human. Davies et al. [3] provide a survey specifically designed to assess various emotional aspects of urban soundscapes. Another survey is described in [2] aiming at quantifying the relationship between pleasantness and environmental conditions. Moving on, the literature includes a limited amount of methods focused on the automatic emotional labeling of soundscapes. Among those, Fan et al. [5] employed a support vector regression scheme fed on a wide range of handcrafted features to assess the emotional characteristics of six classes of soundscapes. In their follow-up work [6] the authors used both handcrafted features and deep nets, boosting the achieved performance. Another framework was developed in [10] based on the bag-of-frames approach using handcrafted features and two support vector machines each one responsible for predicting the pleasantness and eventfulness of 77 soundscapes.

The main limitations of the related literature can be identified in the usage of extensive feature engineering which heavily depends on domain knowledge and poor data availability. This work proposes a deep learning framework for the automatic assessment of the emotional content of soundscapes. Addressing the existing limitations, the framework’s novel aspects are a) relaxing the handcrafted features restriction, b) introduction of two convolutional neural networks (ArNet and ValNet) each one carrying out prediction of arousal and valence of soundscapes, and c) conceptualization and development of the between-sample learning scheme able to meaningfully augment the available feature space. The dataset includes soundscapes coming from six classes, i.e. a) natural, b) human, c) society, d) mechanical, e) quiet, and f ) indicators following Schafer’s organization [20]. After a thorough experimental campaign, we analyze the performance boosting offered by the between-sample learning scheme, while the reported results surpass the state of the art.

The rest of this paper is organized as follows: Section 2 analyzes the proposed between-samples learning paradigm including the entire pipeline. Section 3 presents the experimental set-up and results, while in Section 4, we draw our conclusions.

2 The between-samples learning paradigm

This section details the method used for predicting of the emotional evoked by a soundscape. The proposed method, demonstrated in Fig. 1, mixes sounds coming from multiple classes and complements the training set with the generated samples. Subsequently, the log-Mel spectrum is extracted which is fed to a convolutional neural network carrying out modeling and emotional quantification.

Fig. 1
figure 1

The proposed method starts by soundscape mixing, proceed with feature extraction and finally models the feature space using CNNs

Initially, we briefly analyze the feature set and the regression algorithm, while the emphasis is placed on the way the learning is performed between the available samples.

2.1 Feature set

The present feature set is a simplification of the Mel-Frequency Cepstral Coefficients where the final dimensionality reduction step based on the discrete cosine transform is omitted [13, 14, 27]. To this end, we employed a triangular Mel scale filterbank for extracting 23 log-energies. Firstly, the audio signal is windowed and the short-time Fourier transform (STFT) is computed. The outcome of the STFT passes though the filterbank and the logarithm is computed to adequately space the data. It is worth noting that the usage of such a standardized feature extraction mechanism removes the need to conceptualize and implement handcrafted features specifically designed to address the given problem.

2.2 Convolutional neural network architecture

The structure of the proposed CNN was determined during early experimentations and is shown in Table 1. Starting from the standard, multilayer perceptron model, a CNN includes simple but relevant modifications. Commonly, a CNN is composed by a number of stacked layers forming a deep topology. Here, we consider two convolutional layers each one followed by a max-pooling operation. The convolutional layers organize the hidden units so that local structures are revealed in the 2-d plane and subsequently exploited. This is accomplished by connecting each hidden unit to only a small portion, so-called receptive field, of the input space (e.g. 4 × 4 pixel blocks). In essence, the weights of such units form filters (also called convolutional kernels) applied to the entire input plane and thus, extracting a feature map. At this point we make the assumption that such locally extracted features are useful in other parts of the input plane, thus the same weights are applied on its entirety. This assumption is highly important since not only it minimizes the number of trainable parameters but it also renders the network indifferent to translational shifts of the input data [19]. The max-pooling layers carry out further dimensionality reduction by merging adjacent units and retaining their maximum value, a process which boosts translational indifference. Rectified Linear Units (ReLUs) are employed with the activation function being f(x) = max(0,x). ReLUs dominate the current literature as they tend to offer a) faster gradient propagation than conventional units (logistic sigmoid, hyperbolic tangent, etc.), b) biological plausibility, and c) an activation form characterized by high sparsity [8].

Table 1 The structure of ArNet and ValNet (# of parameters: 2,674,721)

The network is completed by a flattening, a dropout and three densely connected layers responsible for the regression process. The dropout layer helps to avoid overfitting due to the large number of parameters in need of estimation (2,674,721). The specific layer randomly removes 50% of the present hidden units. In general, such an operation removes irrelevant relationships and secures that the learned filters are able to provide reliable modeling and in the present study, SEP.

2.3 Generating samples between the original ones

The dataset chosen in the current study follows the widely accepted Schafer’s soundscape taxonomy [5, 20] based on the referential meaning of environmental sounds. In Schafer’s work, the grouping criterion is the identity of the sound source and the listening context without taking into account audio features. Interestingly, the Emo-Soundscapes corpus described in [5] includes 600 clips equally distributed among the six classes proposed by Schafer, i.e. a) natural, b) human, c) society, d) mechanical, e) quiet, and f ) sounds as indicators.

The second part of the Emo-Soundscapes corpus includes mixed soundscapes coming from these six classes. The initial idea was to study the emotional impact of sound design; however, this work shows that such mixed soundscapes are useful to predict the emotions perceived both by single- and multi-class soundscapes. Each mix is designed so that it includes content coming from either two or three audio clips. Interestingly, there is no restriction during class selection meaning that a certain mixing can include audio clips belonging to the same class. The duration of each clip is 6 seconds which suffices for annotators to efficiently characterize its emotional content in terms of arousal and valence.

In mixed sounds, humans can understand the existence of more than one classes, perceive the one dominating the mixture, etc. quite effortless. Thus, a point placed within the limits of the entire feature distribution must have a meaningful semantic correspondence, while this is not necessarily true for points outside the distribution. Feature distributions characterizing mixed sounds are expected to be located between the distributions of the sounds composing the mixture. At the same time, the mixed variance is proportional to the original feature distributions similarly to the classification problematic described in [23].

Figure 2 demonstrates how the mixing affects the feature distributions. More specifically, two cases are shown: a) mixing of two classes (mechanical and human) and b) mixing of three classes (mechanical, human, and nature). In both cases, we observe that the vast majority of the principal components of the mixed feature vectors lies within the principal components extracted out of the single classes. Figure 3 demonstrates ArNet’s intermediate activations showing how single (top-row), mixture of two (middle row), and mixture of three (bottom row) soundscapes are decomposed unto the different filters learned by the network.

Fig. 2
figure 2

The feature space demonstrating the cases of mixing samples coming from 2 (top) and 3 classes (bottom)

Fig. 3
figure 3

ArNet’s intermediate activations showing how single (top-row), mixture of two (middle row), and mixture of three (bottom row) soundscapes are decomposed unto the different filters learned by the network

3 Experimental set-up and results

This section includes details regarding the dataset, the parameterization of the proposed approach, performance analysis as well as how it compares with the state of the art.

3.1 Dataset

Up until recently, there was a gap as regards to a dataset including emotionally-annotated soundscapes. The work presented in [5] covered this gap by designing and making available to the scientific community, the Emo-soundscapes dataset. It facilitates soundscape emotion recognition tasks by the study of single as well as mixed soundscapes. As mentioned in Section 2.3 the dataset follows Schafer’s organization (human, nature, indicators, ) and includes 1213 6-second Creative Commons licensed audio clips. The annotation of the perceived emotion was carried out by means of a crowdsourcing listening experiment. They recorded both valence and arousal perceived by 1182 annotators from 74 different countries. Detailed information regarding the dataset and its annotation is available in [5].

3.2 Parameterization

The log-mel spectrogram was extracted out of windows is 30 ms with 10 ms overlap. The sampled data are hamming windowed to smooth potential discontinuities, while the FFT size is 512. The CNN operates on a receptive field as in Table 1, the activation function is ReLU, while two networks were trained to model valence and arousal respectively, i.e. ValNet and ArNet. The training process terminated after 100 epochs at a learning rate of 0.01. Each network is trained on minimizing the mean squared error.

3.3 Results

We followed the experimental protocol described in [5] where the dataset is randomly divided n times into training and testing data with a ratio 4:1 and n = 10. At each iteration there is no overlap between the training and testing sets while the reported mean square errors are the averages over n.

The achieved results are summarized in Fig. 4. As we can see, the CNN trained only on the original single-class sounds surpasses the state of the art MSEs w.r.t valence prediction, while the arousal one lies at similar levels. However, the proposed method employing sound mixtures outperforms the other methods significantly at both valence and arousal prediction. The final MSEs are 0.0168 and 0.0107 for valence and arousal respectively. The bottom part of Fig. 4 demonstrates the way these results vary as per Schafer’s categorization. A similar behavior is observed for the majority of the classes. Best valence prediction is achieved for the quiet class while the indicators one is the hardest to predict. Best arousal prediction is achieved for the nature class and the worst for the society class. In general, sounds coming from the society, mechanical, and indicators classes provide the highest MSEs, i.e. worst performance, which may be due to the respective intra-class variability as it can be assessed by a human listener. Following the analysis provided in Section 2.3, we see how the variance offered by mixed samples boosts network’s prediction capabilities. Overall, the method based on learning between samples provided excellent results and surpasses the state of art in emotional quantification of soundscapes.

Fig. 4
figure 4

The results achieved by the proposed approach and how it compares with the state of the art (top) as well as how they vary per Schafer’s categorization (bottom)

4 Conclusions

This work presented a deep learning framework achieving SEP able to surpass the state of the art based on handcrafted features and traditional machine learning algorithms. Interestingly, the accompanying module carrying out between samples learning manages to significantly boost the prediction performance.

In the future, we wish to evaluate the usefulness and practicality of an emotional space formed not only by soundscapes but incorporating generalized sound events, music, and speech. Such a jointly created space may offer improved prediction in multiple applications domains. To this end, we intent to exploit transfer learning technologies [12] forming a synergistic framework able to incorporate and transfer knowledge coming from multiple domains favoring diverse applications, such as music information retrieval, bioacoustic signal processing, etc.