Behavior Research Methods

, Volume 51, Issue 2, pp 532–555 | Cite as

Investigating the perception of soundscapes through acoustic scene simulation

  • G. Lafay
  • M. Rossignol
  • N. Misdariis
  • M. LagrangeEmail author
  • J.-F. Petiot


This paper introduces a new experimental protocol for studying mental representations of urban soundscapes through a simulation process. Subjects are asked to create a full soundscape by means of a dedicated software tool, coupled with a structured sound data set. This paradigm is used to characterize urban sound environment representations by analyzing the sound classes that were used to simulate the auditory scenes. A rating experiment of the soundscape pleasantness using a seven-point bipolar semantic scale is conducted to further refine the analysis of the simulated urban acoustic scenes. Results show that (1) a semantic characterization in terms of presence/absence of sound sources is an effective way to characterize urban soundscape pleasantness, and (2) acoustic pressure levels computed for specific sound sources better characterize the appraisal than the acoustic pressure level computed over the overall soundscape.


Cognitive psychology Soundscape perception Soundscape simulator 


One of the main goals of soundscape studies is to identify which components of the soundscape influence human perception, see Aletta et al., (2016). Establishing a link between an environment (urban sound scene) and the induced human sensation (calmness) is equivalent to instantiating the mental representation of a specific sound scene (a calm urban sound scene). Several means have been considered in the literature to perform this task.

First, a subject can be asked to assess a given sound following a given perceptual scale (e.g., calmness, pleasantness), see Axelsson et al., (2005), Davies et al., (2013), and Cain et al., (2013). The amount of information that can be gathered strongly depends on the nature of the available stimuli, be they recorded sound scenes as part of a within laboratory experiment, or in situ. Working with actual stimuli makes it possible to analyze them, thus allowing the experimenter to gather a coarse-grained physical description of them.

Second, a subject can be asked to describe a given sound environment, see Guastavino (2006) and Dubois et al., (2006). A large amount of quantitative and semantic information is then collected about the subject’s representation of this type of sound environment. Unfortunately, without any reference to sound data, this representation can hardly be characterized physically.

We propose in this paper to consider the use of a soundscape simulator that the subject can employ to objectify his/her representation of a given sound environment. We believe that the use of such a device allows us to gain the benefits of the two above-mentioned approaches. As the subject is asked to produce audio data (the signal of the simulated scene), it allows the experimenter to study a precise modal version of the subject’s mental representation that is characterized both semantically (the nature of sound sources) and physically (e.g., the levels of sound sources).

We believe that the availability of a fine-grain description of the sound stimuli is of great interest, because recent studies demonstrate that no sound sources contribute equally to the perception of the sound scene, see Defréville et al., (2004), Lavandier and Defréville (2006), Guastavino (2006), Nilsson (2007), and Szeremeta and Zannin (2009). Thus, much attention is given to the specific contributions of the different sources to the notion of emotional quality of the scene, see Gozalo et al., (2015) and Ricciardi et al., (2015).

For several reasons that are detailed in this paper, the use of a soundscape simulator such as the one proposed here can lead to interesting outcomes, as it allows the experimenter to separately study the influence of the sound sources that compose a sound environment. With such material, not only is the type of sound sources available for study, but also the exact sound level and audio waveform for each source together with the structural properties of the scene, that is, the temporal distribution of the sound events.

To demonstrate the potential of the proposed approach in its ability to question the human perception of sound environments, we study in this paper the notion of perceptual pleasantness of urban environmental scenes. Results and outcomes of a series of experiments that build upon the use of the simulator are studied in order to better comprehend how different sound sources typically present in an urban scene impact pleasantness:
  1. 1.

    experiment 1.a, simulation: the subjects use the soundscape simulator to produce ideal/non-ideal soundscapes that are considered as material for the following experiments (the resulting waveforms are available for download1);

  2. 2.

    experiment 1.b, pleasantness evaluation: the subjects judge the pleasantness of the simulated scenes on a semantic scale;

  3. 3.

    experiment 2, pleasantness evaluation after modification of the scenes: the subjects judge the pleasantness of the simulated scenes on a semantic scale as in 1.b, but some scenes are modified beforehand, i.e., some specific sounds classes that are identified as having a significant impact on perceived pleasantness are removed.


To the best of our knowledge, only Bruce et al., (2009) and Bruce and Davies (2014) considered the use of a simulator to investigate soundscape perception. They propose a tool that allows the user to modify a given soundscape by adding or removing specific sound sources by changing the acoustic level of the sources as well as their spatial location. The authors show that the addition or removal of the sources globally follows social or semantic considerations more than their acoustical characteristics. A lack of diversity in terms of sound sources is nonetheless mentioned by the authors as a limiting factor to the strength of the outcomes given in this study.

In our approach, the simulator developed for this study2 only yields a monophonic representation of the scene, but that simplification comes with the benefit of a wider range of available sound sources and scheduling parameters in order to provide outputs that are both expressive and useful for analysis.

The remainder of this paper is organized as follows: the soundscape simulator simScene is introduced in Section “The simulator”. Experiments 1.a simulation and 1.b pleasantness evaluation are presented in “Experiment 1”. Experiment 2 pleasantness evaluation after modification of the scenes is presented in “Experiment 2: Modification of the semantic content”. Conclusions and discussion about future work follow in “Outcomes for soundscape perception”.

The simulator

Simscene is an online digital audio tool whose first version has been developed as part of the HOULE project.3 It has been designed to run on the popular Web browsers Chrome and Firefox. It is fully written in JavaScript using the angular.js library4 and the Web Audio standard that allows the manipulation of digital audio data within the browser.5 The interface for selecting sound sources (cf. “Selection interface”) uses the popular D3.js visualization library proposed by Bostock et al., (2011).

Simscene is designed as a simplified audio sequencer, with sequencing parameters specifically chosen for the generation of realistic soundscapes. To do so, the user first selects a sound source using a non-verbal selection interface presented in “Selection interface”. A track is then created for this sound source within the simulator interface. The user can then manipulate some parameters detailed in “Parameters” to control the time and magnitude distribution of the occurrences of the sound sources. Text fields are also available for the user to (1) name each track, (2) name the entire scene, (3) provide free comments about the simulated scene.

Sound database

In order to provide the user with a sound database that is well organized and covers as much as possible the variety of sound sources that are present in urban areas, a typology of urban sounds is first defined.


The chosen typology is established based on the category/classes of sounds found while reviewing several articles or thesis manuscripts that study how humans discriminate different kind of urban soundscape, see Leobon (1986), Maffiolo (1999), Raimbault (2002), Guastavino (2003), Defréville et al., (2004), Beaumont et al., (2004), Raimbault and Dubois (2005), Dubois et al., (2006), Devergie (2006), Guastavino (2006), Polack et al., (2008), Niessen et al., (2010), and Brown et al., (2011).

We choose not to include any musical content in the sound database, as the study of the pleasantness of a given style or genre of music is beyond the scope of this study.

Events and Textures

A commonly accepted distinction consists in separating:
  • sound events: isolated sounds, precisely located in time, whose acoustical characteristics may change with respect to time;

  • sound textures: isolated sounds of long duration whose acoustical characteristics are stable with respect to time, see Saint-Arnaud (1995).

It indeed appears that the processing of auditory information comprises some sort of decision concerning the nature of the stimuli.

Maffiolo (1999) distinguishes two separated categorization processes, either of which is triggered depending on the listener’s ability to identify sound events. She shows that those processes lead to two abstract cognitive categories, respectively termed “event sequences” and “amorphous sequences”. Event sequences are composed of salient events that can easily be recognized, such as car start or male speech. They arise from a descriptive analysis based on the identification of the sound sources. On the contrary, amorphous sequences are sound environments where distinct events cannot be readily identified, such as traffic hubhub. They result from a holistic analysis based on global acoustic features.

Concerning sound textures, i.e., sounds that have stable characteristics over time, McDermott and Simoncelli (2011) and McDermott et al., (2013) demonstrates that the human brain can opt for an abstract, statistical representation of the perceived information, discarding precise physical properties of the sound.

In order to account for the possibility that the morphological differences between sound events and sound textures may have some important consequences on the perception of the scenes, the simulation tool Simscene follows this distinction and considers two distinct sound databases: one with classes of events only and the other with classes of textures only. These two types of sound classes also have specific scheduling procedures during the simulation process.

We consider in this study the soundscape as a “skeleton of events on a bed of textures ” as coined by Nelken (2013).


We call “ sample ” a recording of an isolated sound, be it an event or a texture. Each sound class is implemented as a collection of samples judged to be perceptually equivalent.

The sound classes are organized hierarchically (cf. Fig. 1) according to a structure similar to the vertical axis of the categorical organization proposed by Rosch and Lloyd (1978). The lower the level of abstraction, the more precise the description of the class and the more perceptually similar the sound sources. For classes with a high level of abstraction that have sub classes, their collection of samples is the union of the collections of the sub classes.
Fig. 1

Hierarchical organization of the isolated sounds used in the simulation

Accounting for the previously detailed perceptual matters, two taxonomies are built, one for sound events and one for sound textures. Up to four levels of abstraction are considered from the most generic classes (level 0) to the most specific classes (level 3), leading to a taxonomy close to the one used in Salamon et al., (2014), see Appendix A. Only three levels of abstraction are considered for the texture sounds, see Appendix Fig. 11.

Sound samples collection

A total of 483 isolated sounds were collected and organized with the two typologies discussed above, 381 events and 102 textures. Among those samples, 332 have been recorded and 151 have been taken from two sound libraries: SoundIdeas6 and Universal SoundBank.7

Original sounds have been recorded using a shotgun microphone AT80358 plugged into a ZOOM H4n9 recorder. The use of such a microphone allowed us to isolate as much as possible sound events of interest from the urban background. It also allowed us to avoid dominant sound sources while recording texture sounds by targeting distant areas with no dominant sound sources.

All samples are normalized to the same RMS level of \(-12\)\(dB\) FS, i.e., relative to full scale. In our case, the full-scale level is set arbitrarily to 1 V.


By a deliberate design choice, the simulation tool does not allow the user to interact with and control directly a specific sample. Interaction is done at the track level, a track being a sequence of samples. Several parameters are available to the subject to control the track:
  • sound level (dB): for each sample, the sound levels are drawn randomly following a normal distribution parameterized by the subject in terms of mean value and variance;

  • inter-onset spacing (second): for event tracks only, and for each sound event sample, the inter-onset spacings are drawn randomly following a normal distribution parameterized by the subject in terms of mean value and variance;

  • start and end time (second): the subject sets the start and end times between which the texture or sequence of repeated events occurs.

To improve simulation quality, two parameters are also proposed:
  • event fades (seconds): for the event tracks only, the subject can set a fade in/fade out duration applied to each sample;

  • global fades (seconds): the subject can set global fade in and fade out durations applied to the entire track.

Texture samples are sequenced without time spacing, therefore the parameters event fade and inter-onset spacing are not available for this kind of track.

Selection interface

Once the typology and the set of sounds are available, an important design issue is the need for a suitable way to display the sound dataset to the user. Most browsing tools are based on keyword indexing; however, for sensory experiments that study the objectivation of a subject’s mental representations, this may be problematic as the availability of a verbal description of the sound can influence the subject’s choice, and potentially induce biases in the analysis conclusions. For example, a subject may automatically select sounds referenced as belonging to a park environment to build a calm soundscape, rather than focusing on their perception.

Therefore, the selection interface considered in this study is text-free and designed so as to force the user to rely on listening.

Figure 2a shows the interface used for the selection of events. Each circle corresponds to a sound class, with the lowest level of abstraction (leaves) colored in grey. The spatial location of those circles is chosen according to the hierarchical organization of the sound database: sub-classes belonging to the same class are close to each others, and so on until the user reaches the leaf classes, which are directly linked to a collection of samples.
Fig. 2

SimScene graphical interfaces for the selection of sound classes (a) and their sequencing (b)

Each of those classes has a representative sound chosen arbitrarily by the authors in order to provide the same sound each time the user clicks on the circle. The subject can browse the database by listening to those prototype sounds. The efficiency of this interface compared to several others designs has been evaluated and the outcomes are discussed by Lafay et al., (2016).

Simulation interface

As shown on Fig. 2, the simulation interface displays a schematic of the scene under creation. Each track is represented as a horizontal strip with a temporal axis. Each sample of this track is displayed as a rectangle whose height is proportional to the amplitude of the sample. For event tracks, the horizontal spacing between those rectangles is a function of the time delay between their onsets. For texture tracks, a unique rectangle is displayed as this kind of sounds does not allow spacing with silence. As the actual amplitude and spacing values are drawn from random variables, each time the subject changes the value of a parameter, the location and height of the rectangles are updated to reflect the changes in the sequencing of the samples. The subject can listen to the resulting sound scene at any time.

As such, the underlying model of the scene is a sum of sound sources. The simulation interface is more thoroughly described by Rossignol et al., (2015).

Experiment 1


Experiment 1 aims at using a simulation paradigm to investigate the specific influences of the various sound sources constituting urban soundscapes on the perceived pleasantness. For that, the first two experiments are planned as follows (cf. Figure 3):
  • experiment 1.a (simulation): during this experiment, subjects are asked to create simulated urban sound environments using Simscene (see “The simulator”). Each of them has to create two sound environments: one ideal/pleasant, and the other non-ideal/unpleasant.

  • experiment 1.b (evaluation): after the simulation phase, only a binary information on the pleasantness property of the respective scenes is available: respectively ideal or non-ideal. Furthermore, this information is given by the creator of the scene. The second experimental step aims at investigating more deeply and more broadly our knowledge on the pleasantness of the simulated scenes. For that, a second group of subjects is asked to evaluate the pleasantness of each scene produced during (1.a), on a semantic scale. This experiment has two goals:
    1. 1.

      to evaluate more precisely the respective influence of the various sources composing the scenes on the pleasantness (ideal or non-ideal) thanks to a finer quantification of the pleasantness of the scene;

    2. 2.

      to detect the presence of outliers or ambiguous scenes. Indeed, throughout our analyses, the predefined hedonic properties of the scenes (ideal or non-ideal) are used as reference. We thus need to ensure beforehand that no ambiguity exists between extreme cases of ideal and non-ideal scenes, i.e., that the least pleasant ideal scene remains more highly rated on average than the most pleasant non-ideal scene.

Fig. 3

Experimental protocol of the simulation experiment (1.a) and pleasantness evaluation experiment (1.b)

The data collected by these two experiments (1.a and 1.b) are analyzed conjointly.

Experiment 1.a: Methods

The design of this experiment has been validated with a pilot study described by Lafay et al., (2014).


The subjects are asked to simulate two urban sound environments of one minute each, following these instructions:
  • first simulation: create a plausible urban soundscape which is ideal, according to you (where you would like to live);

  • second simulation: create a plausible urban soundscape which is non-ideal, according to you (where you would not like to live);

All the subjects start by designing the ideal environment; they read the second set of instructions at the end of the first experiment. Subjects are completely free of their choices concerning sounds and synthesis parameters. The created sound environments must nevertheless fulfill the two following constraints:
  • the listening point of view is that of a fixed listener;

  • the soundscape must be realistic, i.e., physically plausible. For instance, subjects are free to insert ten dogs in the soundscape but they cannot insert one dog barking every 10 ms.

These constraints are notified in the instructions; no control is done a priori in the simulation interface.

Each simulation process is decomposed into several steps:
  1. 1.
    Simulation, where the user is asked to:
    • select sound classes,

    • give each of them a name,

    • set the parameters of the tracks related to the selected sound classes of sounds, see “Parameters”.

  2. 2.

    Feedback : writing of a free form comment about the composed soundscape.

In addition, once the two sound scenes are completed, the subjects are invited to:
  • point out the sound sources that were missing for the composition;

  • give a comment about the ergonomics of the simulation environment;

  • give a comment about the ergonomics of the selection tool.

Before starting the first simulation, a 20-min tutorial is given in order to familiarize the subjects with the simulation interface and the sound database. The experiment is planned to last about 2.5 h, including breaks that the subjects are allowed to take.


All the subjects performed the experiment on standard desktop computers with the same hardware and software configurations. The audio files were played in diotic conditions using headphones. During the tutorial, subjects were asked to adjust the sound volume to a comfortable level. Once set, they were not allowed to modify it during the remaining of the experiment.

All the subjects performed the experiment at the same time. They were equally distributed in three identical quiet rooms, and were not allowed to talk to each other during the experiment.

Three experimenters (one in each room) were available during the whole duration of the experiment in order to assist subjects with potential hardware and software issues, and to answer questions.


Forty-four students (30 male, 14 female; averaging 21.6 years of age, SD. of 2.0 years) from Ecole Centrale de Nantes (a French engineering school) performed the experiment. All the subjects had been living in Nantes, France, for at least 2 years at the time of the experiment and reported normal hearing.

Among the 44 subjects, 40 succeeded, producing in the end 80 simulated sound scenes (40 ideal scenes, 40 non-ideal scenes). The four other subjects were excluded from the process due to a lack of understanding of the instructions or failure to respect them, or for exceeding the maximum duration allowed to perform the experiment. The software platform used for the experiment, the parametrization of the software platform for each generated scene, as well as a two-dimensional projection of the resulting scenes are available online.10

Experiment 1.b: Methods


The subjects evaluate the 80 simulated scenes generated in experiment 1.a. Due to temporal constraints, subjects only assess 30 s of the initial 1-min simulated scenes (from s 15 to s 45).

The assessment is done with a seven-point bipolar semantic scale going from -3 (non-ideal/unpleasant) to + 3 (ideal/pleasant). Before evaluating a scene, the subjects must listen at least to the first 20 s of the stimuli. After the evaluation, they are free to continue to the next scene.

For each participant, sound scenes are played in a quasi random order. Five ideal scenes and five non-ideal scenes are first sequenced to allow the subjects to calibrate their scores. These first ten scenes are played back again at the end of the experiment. Only the last evaluations are taken into account. Each participant evaluates all the sound scenes.

The experiment is planned to last 30 min. The subjects do not know anything beforehand about the nature of the sound scenes.


All the subjects performed the experiment on standard desktop computers with the same hardware and software configurations. The audio files were played in diotic conditions by semi-open headphones Beyerdynamic DT 990 Pro. The stimuli were the scenes obtained in experiment 1.a. The output sound level was the same for all the subjects.

All the subjects performed the experiment simultaneously in a quiet environment. They were not allowed to talk to each other during the experiment.

An experimenter was available during the whole duration of the experiment in order to assist subjects and to answer questions.


Ten students (eight male, two female; averaging 23.1 years of age, SD of 1.8 years) from Ecole Centrale de Nantes performed the experiment. All the subjects had been living in Nantes, France, for at least 2 years at the time of the experiment and reported normal hearing. None of them took part in the previous simulation experiment (experiment 1.a).

All of the subjects succeeded in doing the experiment.

Data and statistical analysis

Table 1

Abbreviations of features used in the analysis of the experiments



Sound level


Sound level (events)


Sound level (textures)


Average pleasantness (per scene)

\(\mathcal {A}_{scene}\)

Average pleasantness (per subject)

\(\mathcal {A}_{subject}\)

A set of features, upon which the analysis is conducted, is attached to each sound scene. A summary of those features (and the corresponding abbreviations) is presented in Table 1. In order to be consistent with the evaluation of experiment 1.b, features are not computed on the whole duration of the sequences but only on their 30-s reduced version used as stimuli for experiment 1.b (“Experiment 1.b: Methods”).

For each sound scene, three types of features are considered:
  • perceptual features: the perceived pleasantness of the composed scene, assessed on a seven-point bipolar semantic scale. \(\mathcal {A}_{scene}\) is the average pleasantness for each scene, computed as the average of all the scores given by all the subjects to a specific scene. \(\mathcal {A}_{subject}\) is the average pleasantness for each subject, computed as the average of all the scores given by a specific subject to all the scenes. \(\mathcal {A}_{subject}\) is computed for ideal scenes and non-ideal scenes separately. Given the low number of subjects in experiment 1.b, we choose not to normalize the pleasantness scores.

  • semantic features: we use a Boolean vector \(S = (x_{1}, x_{2}, \ldots , x_{n})\) that indicates the classes of sounds involved in the scene, i.e., the sound classes that are present absent from the scene. Each Boolean x of this vector corresponds to a specific class of sounds: \(x = 1\) if the class is present in the scene, and \(x = 0\) otherwise. The vector dimension (n) depends on the level of abstraction that is considered for the analysis. For instance, for the abstraction level 1, that includes 44 classes of sounds, the dimension is thus 44 (n = 44).

  • structural features: while SimScene allows us to access a variety of information about the scene structure (such as the density of events), we only focus in this first study on the sound levels. To figure those out, we draw inspiration from the \(L_{Aeq}\) measure. In our case, it is a scalar computed from the signal (in Volts and not in Pascal), and converted in decibels, with a reference of 1 V (full scale). The level is obtained by computing the quadratic mean of the signal every second and averaging the results over the total duration of the scene. An A-filtering module processes the data before the quadratic means are computed. We note L, \(L(E)\) and \(L(T)\) the computed levels by respectively considering the whole set of samples, only the set of event samples, and only the set of texture samples.

In order to evaluate the specific impact of the various sound sources on the perceived pleasantness, we run the data through the five following significance tests:
  • Analysis of perceived pleasantness: the goal is to evaluate whether the perceived pleasantness is in accordance with the pleasantness label given by the creators of the ideal and non-ideal scenes during experiment 1.a. To do so, we consider if there exist significant differences between the ideal and non-ideal scenes for \(\mathcal {A}_{scene}\) and \(\mathcal {A}_{subject}\). The significance is evaluated by a two-sample Student’s test for \(\mathcal {A}_{scene}\) and by a paired-sample Student’s test for \(\mathcal {A}_{subject}\).

  • Analysis of sound levels: the goal is to evaluate whether the sound levels (L, \(L(E)\) and \(L(T)\)) differ between the ideal and non-ideal scenes. The significance is measured with a two-sample Student’s test.

  • Influence of sound levels on perceived pleasantness: the goal is to evaluate whether the sound levels (L, \(L(E)\) and \(L(T)\)) affect the perceived pleasantness. To do so, we consider linear correlations between those features and \(\mathcal {A}_{scene}\). The Pearson correlation coefficient is used for that purpose.

  • Analysis of semantic features: the goal is to evaluate whether specific classes are more frequently used in a given type of environment (ideal or non-ideal). To do so, a V-test is considered, see Rakotomalala and Morineau (2008). With c being the total number of classes used to simulate all the scenes, \(c_{k}\) the number of classes used to simulate the scenes of a given type of environment k (ideal or non-ideal), \(c_{j}\) the number of times a class j has been used to simulate all the scenes, and \(c_{jk}\) the number of times a class j has been used to simulate the scenes of a given type of environment k, the V-test evaluates the null hypothesis that the ratio \(\frac {c_{jk}}{c}\) is not significantly different from the ratio \(\frac {c_{jk}}{c_{k}}\). For each class j, and each environment type k, an approximation of the statistical value \(V_{jk}\) is computed as follows:
    $$ V_{jk}=\frac{c_{jk}-c_{k}\frac{c_{j}}{c}}{\sqrt{c_{k}\frac{c-c_{k}}{c-1}\frac{c_{j}}{c}(1-\frac{c_{j}}{c})}} $$

    If the null hypothesis is rejected, the class j is said to be typical with respect to the type of environment k. Such typical classes are called sound markers, in reference to the work of Schafer (1993). Testing is done for each class, at each level of abstraction, and separately for texture and event classes.

  • Representation space induced by the semantic features: the goal is to determine if a representation space of the scenes solely based on the presence/absence of sound sources allows us to distinguish between the two types of scenes. Denoting as \(S_{i}\) the semantic features of scene i, we compute the distances between all \(S_{i}\) vectors. A Hamming distance is used: considering two n-dimension vectors \(S_{1}=(x_{1,1},x_{1,2},\ldots ,x_{1,n})\) and \(S_{2}=(x_{2,1},x_{2,2},\ldots ,x_{2,n})\), with \(x \in \lbrace 0,1\rbrace \), the Hamming distance \(d_{ham}\) measures the proportion of coordinates that differ between the two vectors. It is defined as follows:
    $$ d_{ham}(S_{1},S_{2})=\frac{1}{n}\sum\limits_{i = 1}^{n} (x_{1,i} \bigoplus x_{2,i}) $$
    where \(\bigoplus \) is the exclusive-or operator. Two scenes having similar source compositions will be close in such a space. Using the Hamming distance allows us to take into account equally the presence and absence of classes. In order to measure the intrinsic ability of the space to discriminate between ideal and non-ideal scenes, we use a ranking metric named the precision at rank k (P@k). The \(P@k\) computes the precision obtained after the k closest items to a given seed item have been found. Formally, for each scene \(s_{i}\) (considered as seed), we compute the proportion of \(s_{j}\) scenes in the k nearest neighbors of \(s_{i}\) that share the same label as \(s_{i}\). The \(P@k\) is then the average of this ratio for all the items considered as search seeds.
  • Influence of the sound markers on the perceived pleasantness: in order to assess the specific contributions of some sound sources, we again estimate the impact of the sound levels on the perceived pleasantness by taking into account only the sound markers for the computation of those features.

All statistical significance tests are conducted with a critical threshold of \(\alpha = 0.05\). For the V-test, considering that a large number of classes is tested, a Bonferroni correction is applied. For the p value, if \(p\geq 0.05\), the value is reported; if \(0.01\leq p<0.05\), we only report \(p<0.05\), otherwise we report \(p<0.01\).
Table 2

Linear correlation coefficients computed between mean perceived pleasantness \(\mathcal {A}_{scene}\) of experiment 1.b and sound levels. Statistically significant results shown in bold


All scenes

Ideal scenes

Non-ideal scenes


−0.77 (p < 0.01)

\(-\)0.32 (p = 0.06)

−0.78 (p < 0.01)


−0.75 (p < 0.01)

\(-\)0.20 (p = 0.24)

−0.75 (p < 0.01)


−0.53 (p < 0.01)

\(-\)0.33 (p = 0.05)

\(-\)0.00 (p = 0.99)


Analysis of perceived pleasantness

First, in order to ensure the coherence of the data, we check that none of the non-ideal scenes gets a \(\mathcal {A}_{scene}\) higher than one of an ideal scene. Four non-ideal scenes do not fulfill that constraint: they are thus removed, together with their corresponding ideal scenes. As a consequence, 36 ideal scenes and 36 non-ideal scenes remain for analysis. Second, we verify that subjects really perceived a difference in terms of pleasantness between ideal and non-ideal scenes. For that, we investigate the mean pleasantness score for each participant \(\mathcal {A}_{subject}\), computed separately for each type of environment. It indeed appears that the ideal scenes were perceived significantly more pleasant (p < 0.01) than the non-ideal scenes.

Analysis of sound levels

First, our analysis focuses on the sound levels. Figure 4a, b, and c respectively depict the distributions of levels L, \(L(E)\) and \(L(T)\). There is a significant difference in terms of sound levels between ideal and non-ideal scenes (L: \(p<0.01\), mean deviation: -7 \(dB\)). This difference is significant for events (L(E): \(p<0.01\), mean deviation: -7 \(dB\)) and for textures (L(T): \(p<0.01\), mean deviation: -6 \(dB\)).
Fig. 4

Distributions of the sound levels L (a, d), L(E) (b, e) and L(T) (c, f), with respect to scene type (i: ideal, ni: non-ideal) (a, b, c) and perceived pleasantness \(\mathcal {A}_{scene}\) of experiments 1.b (d, e, f)

As expected, the sound level of the sources is indeed a pleasantness indicator, as the non-ideal scenes tend to be louder. This result is also an outcome of a large number of related studies. We also notice that this difference of sound levels is significant for both events and textures.

It appears that the biggest influence on the global sound levels comes from the events, the difference between L and \(L(E)\) being only 1 \(dB\) between ideal and non-ideal scenes. This observation is in agreement with the results obtained by Kuwano et al., (2003). During their experiment, the authors ask their subjects to assess a set of soundscapes at a global level and then to do the same judgment at the time when they detect a sound source. The study shows that there is no significant difference between global and averaged instantaneous judgments. In our case, the result can be interpreted as if the subjects had unconsciously integrated this perceptual reality when composing the scenes, by allocating most of the global sound levels to well identified and relatively short sounds, i.e., the events.

However, sound level alone is not sufficient to fully differentiate between ideal and non-ideal scenes. In fact, \(20\%\) of the ideal scenes have sound levels higher than the lowest level of the non-ideal scenes, while there is no overlap when considering the perceived pleasantness \(\mathcal {A}_{scene}\).

Influence of sound levels on the perceived pleasantness

In this section, more detailed relationships that could exist between sound levels and perceived pleasantness are investigated. Contrary to the previous test, we do not limit ourselves to a binary ideal scenes vs. non-ideal scenes distinction: we consider here the mean pleasantness \(\mathcal {A}_{scene}\) as the perceptual feature. The aim is to investigate the level of correlation between sound levels and \(\mathcal {A}_{scene}\). The linear correlation coefficients computed between \(\mathcal {A}_{scene}\) vs. L, \(L(E)\), \(L(T)\) are shown in Table 2. Relationships between \(\mathcal {A}_{scene}\) and the sound levels are depicted in Fig. 4d, e, and f.

Concerning L, a strong negative correlation with \(\mathcal {A}_{scene}\) is measured (r = − 0.77, \(p<0.01\)), indicating that the higher the sound level is, the more unpleasant the scene is perceived. However, Fig.4d suggests that this relationship does not occur in the same way for ideal and non-ideal scenes. In fact, the correlation between L and \(\mathcal {A}_{scene}\) remains high for non-ideal scenes (r = − 0.78, \(p<0.01\)), but is not significant (r = − 0.32, \(p = 0.06\)) for ideal scenes.

When considering the whole set of scenes, the fact that the level is indeed a good indicator of pleasantness can be explained by the fact that the ideal scenes tend to be softer than the non-ideal scenes, thus allowing us to extend artificially to the ideal scenes the negative correlation observed for the non-ideal scenes.

We thus conclude that L:
  • allows us to differentiate between ideal and non-ideal scenes,

  • characterizes precisely the perceived pleasantness for non-ideal scenes (unsurprisingly, an unpleasant scene gets all the more unpleasant as it gets louder),

  • is not a relevant feature for modeling the perceived pleasantness of an a priori pleasant soundscape (ideal scene).

The same conclusions can be drawn for \(L(E)\), see Fig. 4f. For \(L(T)\), as shown in Fig. 4, the moderate correlation observed for the whole set of scenes (r = − 0.53, \(p<0.01\)) disappears when separate scenes are considered (ideal scenes: \(r=-0.33\), \(p = 0.05\), non-ideal scenes: \(r=-0.00\), \(p = 0.99\)). Again, we believe that the negative correlation coming from the whole is an artifact due to the level difference between the two types of scenes (ideal scenes tend to be softer than non-ideal scenes). Thus, while sound event levels maintain a relative ability to predict the pleasantness of the non-ideal scenes, texture levels do not provide much information, independent of the environment.

To sum up, for an unpleasant environment, sound levels, especially those of events, negatively influence the perceived pleasantness. On the contrary, for a pleasant environment, none of the sound levels considered in the study seem to influence the perceived pleasantness.

Those first outcomes tend to show that (1) two modes of perception exist depending on the nature of the environment (ideal or non-ideal), and (2) each involves distinct, independent features. The fact that L is not sufficient to characterize the pleasantness of the ideal scenes can lead us to conclude that all sound sources do not equally contribute to the perception of pleasantness. We thus put forward the hypothesis that only the level of some of them can influence this perception. In order to investigate further in that direction, we analyze in the next section the scenes from a semantic point of view, i.e., we take an interest in the nature of the sources they are composed of.

Analysis of the semantic features

We study the composition of the scenes by counting the number of subjects who used a given class of sounds to simulate a given type of environment (ideal or non-ideal). For the 36 ideal and 36 non-ideal scenes considered, results are shown on Fig. 5a for events and on Fig. 5b for textures. For the sake of clarity, a transitional level of abstraction between level 0 and 1, named \(0+\), is used to depict classes, see Appendix Figs. 910 and 11.
Fig. 5

Proportion of simulated scenes (i: ideal, ni: non-ideal) involving a specific class of sounds: (a) event classes at an abstraction level 0 +, (b) texture classes at an abstraction level 0, (c) event sub-classes at an abstraction level 1 belonging to traffic, and public transportation classes of the abstraction level 0

We observe a noticeable difference in terms of class choices between the ideal and non-ideal scenes. The distribution of the classes is very similar to the one obtained in a related work on ideal urban soundscapes by Guastavino (2006), i.e., on one hand, classes involving human presence and nature are prevailing in the ideal scenes, and on the other hand, classes involving mechanical sounds and/or public works are prevailing in the non-ideal scenes. These results confirm a fact previously observed by Raimbault and Dubois (2005) and Dubois et al., (2006): the semantic nature of the sound sources play an important role in the assessment of the environment.

Nevertheless, we notice some differences with the results of Guastavino (2006) which show that sounds of public transportation are specific of ideal urban soundscapes. The authors interpret this by the fact that the perception of pleasantness is, among other things, due to socio-cultural factors. Thus, in our representation of the world, sounds of public transportation would be positively connoted and tend to be more accepted than sounds of personal vehicles.

To a certain extent, our results contradict this result. In fact, Fig. 5a shows that public transportation classes (bus and train, cf. Figure 5c) have been used by the subjects for \(28\%\) of the ideal scenes, and for \(42\%\) of the non-ideal scenes. Those results do not question the fact that sounds of public transportation are well accepted: \(25\%\) of the subjects used bus for the ideal scenes, a level that is comparable to the bike class, and much higher than all the personal vehicles classes. However, public transport classes are also strongly present in the non-ideal scenes, for instance more than light vehicle or truck classes. On the basis of our results, the public transportation class cannot be considered as typical of an ideal urban soundscape.

This difference may be explained by the nature of the experimental protocol used in the two studies. As in our study, Guastavino asks the subjects to describe an environment, but she asks them to perform this task starting only from their memories, whereas in our case subjects perform the same task using actual sound samples that they can listen to. The fact that subjects in our experiment are faced to the acoustic reality of the sounds for composing the environment may have decreased the socio-cultural impact. Other studies that considered sounds as stimuli have shown that the bus class can have a negative influence on the assessment of the environment, see Lavandier and Defréville (2006).

Sound markers

Table 3

Event and texture classes identified as sound markers


Event sound markers


Ideal scenes

Non-ideal scenes



construction work (3.78)


church bell (4.5)

car horn (3.9)


bicycle bell (4.3)

siren (3.9)


animal (4.2)


bird (4.8)

car horn (4.0)


church bell (4.4)

siren (4.0)


bicycle bell (4.2)


bird song (4.8)

klaxon (4.1)


church bell (4.3)

siren (4.0)


bicycle bell (4.2)


foot steps (3.6)


Texture sound markers


Ideal scenes

Non-ideal scenes


courtyard / park (4.1)

construction (3.9)


park (3.65)

crossroad (3.6)


construction vehicle (3.3)


park (3.64)

crossroad (3.56)

In each cell, markers are ranked as decreasing order of V-test value, shown between parenthesis. \(p\leq 0.01\) for all sound markers

We have shown that, from a qualitative point of view, the composition of the scenes in terms of sound sources differs between ideal or non-ideal scenes. We now investigate whether some of the sound classes are specific to a given environment. For that purpose, the V-test detailed in “Data and statistical analysis” is considered separately for each abstraction level. Results are presented in Table 3.

Regarding sound events, nine markers are identified for all abstraction levels. As shown on Fig. 5, classes related to human presence (male footsteps on concrete, bicycle bell), and of nature (animals, bird, and bird song) are ideal scenes markers as well as the church bell class. This latter result may be due to the socio-cultural background of the subjects who are mostly European citizens. In fact, according to Schafer, a sound that is identified by a person as being an important element of his/her environment, is well accepted. Sound markers of non-ideal scenes are classes related to construction sites (construction works), or suggesting intense traffic (horn, siren).

Regarding textures, five markers are identified. For the ideal scenes, those are classes related to subdued or quiet ambiances (courtyard, park). The marker classes for the non-ideal scenes are, as for the events, related to construction sites (construction, construction vehicle), together with a class related to traffic (crossroads).

Although the whole set of identified markers are rather intuitive, none of the event classes related to the noise of motor vehicles are identified as markers, except for the texture class crossroads. To generate an unpleasant traffic, subjects chose the classes horn or siren. We thus conclude that isolated motor vehicle sounds are understood as being part of the urban environment, and thus their nature is not necessarily linked to an unpleasant soundscape.

Representation space induced by semantic features

In this section, we evaluate at which level a semantic representation of the scenes allows us to discriminate between the two types of environments. For this purpose, a rank 5-precision is computed on the space induced by the semantic features S, and for each abstraction level (see “Data and statistical analysis”). The vectors S are built by using all the classes (ET), only the event classes (E), only the texture classes (T), only the event classes corresponding to sound markers (Em), or only the event classes excluding sound markers (Ew/o,m). Texture classes corresponding to sound markers are not numerous enough to reliably compute the metric, and are thus not considered. For the same reasons, event classes corresponding to sound markers at abstraction level 0 are also discarded. Results are shown on Fig. 6.
Fig. 6

Rank 5-precision (P@5) obtained by considering the dissimilarity matrix computed from the paired Hamming distances of the semantic features vectors as a function of the abstraction level. The vectors are built by using all the classes (ET), only the event classes (E), only the texture classes (T), only the event classes corresponding to sound markers (Em), or only the event classes excluding sound markers Ew/o,m. Baseline results are achieved by considering random vectors as input

Concerning ET, the rank 5-precision is \(76\%\) at abstraction level 0 (the most abstract), and remains above \(86\%\) for subsequent abstraction levels. Considering only the presence/absence of sound classes thus allows us to properly discriminate between the two types of environments. We also notice that the less abstract (and therefore more precise) the description is, the more effective it is to predict agreement.

Considering E and T separately, it appears that: (1) the rank 5-precision obtained with E is similar to the one obtained with ET ; (2) the rank 5-precision obtained with T is always lower than the one obtained with E, by \(10\%\) to \(15\%\). Those results indicate that the semantic information that is discriminative is mostly carried by the events. Those results are in line with results of Maffiolo (1999). As discussed in “Sound database”, it seems that humans analyze the event scenes, which are composed of several sound events in a descriptive manner, i.e., by identifying the sources.

The dimension of the vectors S for \(E_{m}\) is lower than the dimension of vector S for E, itself lower than the dimension of vector S obtained when all the classes are considered (ET). S being a Boolean vector, the smaller the dimension, the lower the amount of information it can carry. Despite this, it appears that the rank 5-precision obtained with \(E_{m}\) is equal (or superior) to the ones obtained with E or ET, although only a partial information is used in that case to describe the scenes. Reciprocally, if the sound markers are not taken into account for the description (Ew/o,m), the performance is below the one achieved when considering only the textures as features. Thus, most if not all of the semantic information allowing to differentiate between ideal and non-ideal scenes is included in the markers.

To sum up, the outcomes of this analysis are:
  1. 1.

    unlike what we outlined with the sound levels, a semantic description of the scenes composition in terms of presence/absence of sound sources allows us to reliably differentiate between the two types of environments (ideal or non-ideal);

  2. 2.

    the semantic information is mainly contained in the event sound classes;

  3. 3.

    only a part of the event classes, i.e., the sound markers, are useful to differentiate between the ideal and non-ideal scenes.

Since we have extracted the typical classes of the ideal and non-ideal scenes and verified that the distinction between those two types of scenes was largely dependent on the presence of these classes, we shall now investigate whether a description of the scenes only based on the sound pressure level of these sound markers could characterize the perceived pleasantness, perhaps better than a globally computed sound level.
Table 4

Linear correlation coefficients computed between mean perceived pleasantness \(\mathcal {A}_{scene}\) (Exp. 1.b) and sound levels related to the presence of sound markers. Statistically significant results shown in bold


Ideal scenes

Non-ideal scenes

L m

0.03 (p = 0.88)

−0.75 (p < 0.01)


0.08 (p = 0.66)

−0.71 (p < 0.01)


\(-\)0.11 (p = 0.66)

\(-\)0.17 (p = 0.37)

L b

−0.52 (p < 0.01)

\(-\)0.32 (p = 0.06)


−0.51 (p < 0.01)

\(-\)0.30 (p = 0.07)


\(-\)0.32 (p = 0.05)

−0.73 (p < 0.01)


0.67 (p < 0.01)

\(-\)0.31 (p = 0.07)


0.66 (p < 0.01)

\(-\)0.28 (p = 0.10)


0.16 (p = 0.54)

0.21 (p = 0.28)

Influence of sound marker levels on the perceived pleasantness

To do so, the correlations between \(\mathcal {A}_{scene}\) and the sound levels are evaluated. In this section, the sound levels are computed by taking into account only the previously identified sound markers. We define \(L_{m}\) (resp. \(L(E)_{m}\) and \(L(T)_{m}\)), the sound level computed by taking into account only the sound markers, and \(L_{b}\) (resp. \(L(E)_{b}\) and \(L(T)_{b}\)), the sound level computed by taking into account all the sound classes, except the sound markers. When the feature characterizes an ideal scene (resp. non-ideal scene), only the markers identified for the ideal scenes (resp. non-ideal scenes) are considered. We henceforth call ideal markers and non-ideal markers the two types of markers. Results are shown on Table 4.

Concerning the sound levels, the same trends are measured between \(L_{m}\), \(L(E)_{m}\) and \(L(T)_{m}\) on one side, and L, \(L(E)\) and \(L(T)\) on the other side, see Fig. 7a and d. No matter whether all the classes or only the markers are considered, it appears that:
  1. 1.

    a significant difference between levels of ideal and non-ideal scenes exists (Lm, \(L(E)_{m}\) and \(L(T)_{m}\): \(p<0.01\));

  2. 2.

    the sound level of scenes is mainly related to the sound events, compared to the textures;

  3. 3.

    the sound level of events has an influence on the perception of pleasantness for non-ideal scenes, but not for ideal scenes;

  4. 4.

    the sound level of textures does not play any role in the perception of the pleasantness.


To conclude, the level of non-ideal markers has a negative influence on the pleasantness for the non-ideal scenes. On the other hand, the level of ideal markers does not influence the perceived pleasantness for the ideal scenes.

Considering the non-markers classes, we can observe on the ideal scenes results, a weak negative correlation for \(L_{b}\) (r = − 52, \(p<0.01\)) and \(L(E)_{b}\) (r = − 51, \(p<0.01\)), see Fig. 7b and e. It is the first time that an objective feature allows us to define the pleasantness of ideal scenes. This leads us to conclude that the level of non-typical sound classes of a pleasant environment has a negative influence on the pleasantness.
Fig. 7

Distribution of the relative sound levels related to the presence of markers Lm (a, d), Lb (b, e) and LmLb (c, f), versus scene type (i: ideal, ni: non-ideal) (a, b, c) and perceived pleasantness \(\mathcal {A}_{scene}\) of experiment 1.b (d, e, f)

Moreover, whereas \(L(T)\) did not show any significant correlation for non-ideal scenes, a strong negative correlation is observed for \(L(T)_{b}\) (r = − 0.73, \(p < 0.01\)). This indicates that the level of non-marker texture classes does not influence the perceived pleasantness in the same way for ideal and non-ideal scenes. Sound level of textures seems to have a negative effect for the non-ideal scenes, and no significant effect for the ideal scenes.

A last group of features is now considered, namely \(L_{m}-L_{b}\), \(L(E)_{m}-L(E)_{b}\) and \(L(T)_{m}-L(T)_{b}\). These features describe the difference between the markers level and those of the other sound classes, see Fig. 7c and f. They express the saliency of the markers with respect to the sound mixture.

For the ideal scenes, a moderate positive correlation is observed for \(L_{m}-L_{b}\) (r = 0.67, \(p<0.01\)) and \(L(E)_{m}-L(E)_{b}\) (r = 0.66, \(p<0.01\)). For the non-ideal scenes, no correlation is observed. Thus, for the ideal scenes, it is not the absolute markers level that is important, but their relative level with respect to the other sounds composing the scene. A double perceptual mechanism for the ideal environments can thus be observed:
  • the higher the absolute level of sounds not being ideal markers, the weaker the pleasantness,

  • the higher the relative level of ideal markers compared to the remaining sounds, the higher the pleasantness.

On the contrary, the fact that we observe for the non-ideal scenes both significant correlations for \(L_{m}\) and \(L(E)_{m}\) and no correlation for \(L_{m}-L_{b}\) and \(L(E)_{m}\) - \(L(E)_{b}\), shows that it is indeed the absolute level that matters for non-ideal environments.


From this analysis, the following points can be outlined:
  • differentiating ideal and non-ideal scenes: the semantic features, and the global sound levels (L, \(L(E)\) and \(L(T)\)), allow us to differentiate reliably between ideal and non-ideal scenes. The semantic description seems to be more powerful;

  • events or textures: whatever the feature type, be it semantic or related to the sound pressure level, events are the most useful components of the scene to differentiate the two types of scenes; textures bring a limited amount of information;

  • pleasantness prediction: considering the correlation between sound levels and pleasantness, it seems that the way subjects perceived the quality of a given environment depends on its very nature (ideal or non-ideal). From the data gathered in those experiments, the same set of features cannot be considered to predict the pleasantness of ideal and non-ideal scenes:
    • for non-ideal scenes, the global levels (L and \(L(E)\)), or the level of sound markers (Lm and \(L(E)_{m}\)), have a negative influence on pleasantness. Taking into account the contribution of each of the different sources of the scene does not improve the prediction performance compared to a global analysis of the environment.

    • for ideal scenes, on the contrary, the sound markers characteristics and those of the other sounds have to be considered separately to predict the pleasantness. The markers level relative to the background noise (L(E)mL(E)b and \(L_{m}-L_{b}\)) is positively correlated to the pleasantness, whereas the noise level (Lb and \(L(E)_{b}\)) is negatively correlated.

The fact that the pleasantness of the ideal scenes is not correlated to global physical features, contrary to the pleasantness of non-ideal scenes, has also been studied recently by Gozalo et al., (2015).

We can assume two perceptual modes of operation that involve different types of features and rely on the hedonic nature of the stimuli. It thus appears that the features considered for the pleasantness judgment also depends on a preliminary judgment of the global hedonic nature of the environment (ideal or non-ideal).

A similar phenomenon is observed for the perception of textures, see “Sound database”. It seems that the brain selectively adapts the way it encodes the information (statistic summary for textures, finer description for events) following a previous decision-making process based on the nature of the stimuli, i.e., is it an event or a texture?

Another hypothesis would be that the volume somehow acts as an hedonic “gain” factor. If the volume of a negative marker is high, it lowers the overall pleasantness. If the volume of a positive marker is high, it increases the overall pleasantness. Evidently, the positive gain is expected to saturate at a given level and will quickly decreases as the level raises above a given threshold.

Experiment 2: Modification of the semantic content


The previous experiment demonstrated that, among the classes of sounds occurring in urban soundscapes, those gathering markers are specific to some environments. Those sound markers seem to have a great impact on perception. This impact is studied here in more detail using an added benefit of the simulation paradigm proposed in this paper, the ability to manipulate and modify the generated scenes.

In order to investigate deeper into the relation between the pleasantness of ideal and non-ideal scenes and the sound markers, the audio waveforms of the simulated scenes are regenerated without the classes identified as markers. To do so, ideal markers are removed from the ideal scenes, and non-ideal markers are removed from the non-ideal scenes. To evaluate the impact on the perception of pleasantness caused by those modifications, a perceptual test is conducted with a protocol close to the one considered in experiment 1.b.

The objective of this experiment is to study if the removal of the previously identified markers has an impact on the perceived pleasantness. Two hypothesis are thus formulated:
  • for the non-ideal scenes, we hypothesize that the absence of non-ideal markers will increase the pleasantness score;

  • for the ideal scenes, we hypothesize that the absence of the ideal markers will decrease the pleasantness score.

If the first hypothesis is rather intuitive, the second is less. Indeed, it is not obvious that the removal of the ideal markers, though perceptively positively connoted, will decrease the global quality of a soundscape, since this removal also decreases the global sound level of the scene. However, as discussed before, the global amplitude level can only be considered as a partial indicator of pleasantness of the ideal soundscapes. Furthermore, the level of ideal markers positively impact the pleasantness. For those reasons, the validation of the second hypothesis is of high interest.

Experiment 2: Methods


There are 144 stimuli of 30-s duration. More precisely:
  • 72 scenes with markers: the 72 scenes originally simulated by the subjects of experiment 1.a where the sound classes identified as markers are kept.

  • 72 scenes without markers: 72 scenes where the sound classes identified as markers are removed.

Notwithstanding the presence or absence of markers, the scenes with and without markers are exactly the same.

In order to create the marker-less scenes with still some sound diversity and no absence of sound activity for long periods within the scene, only the sound classes of events of the first level of abstraction are removed, see Table 3. Those classes are:
  • church bell, bicycle bell, and animals for the ideal scenes without markers;

  • siren, car horn for the non-ideal scenes without markers.

Thus, only part of the ideal and non-ideal markers are removed from the scenes without markers.


The subjects evaluate the 144 scenes. The evaluation is done on a 11-point bipolar semantic scale ranging from -5 (non-ideal/very unpleasant) to + 5 (ideal/very pleasant). Before rating a scene, subjects must listen to the first 20 s. After scoring, they are free to move on to the next scene.

For each subject, the scenes are presented in a random order. The first ten scenes allow the subject to calibrate their scores. Those calibration scenes are five ideal scenes with markers and five non-ideal scenes with markers. These first ten scenes are replayed at the end of the experiment, and only the scores given at the second occurrence are taken into account.

The experiment is scheduled to last 1 h. The subjects do not know the nature of the scenes.


All the subjects performed the experiment on standard desktop computers with the same hardware and software configurations. The audio files were played in diotic conditions by semi-open headphones Beyerdynamic DT 990 Pro. The output sound level was the same for all the subjects.

All the subjects performed the experiment simultaneously in a quiet environment. They were not allowed to talk to each other during the experiment.

An experimenter was available during the whole duration of the experiment in order to assist subjects and to answer questions.


Twelve subjects performed the experiment (eight male, four female; averaging 29.5 years of age, SD of 14 years). All the subjects had been living in Nantes, France, for at least 2 years at the time of the experiment and reported normal hearing. None of them took part in the previous experiments 1.a and 1.b.

All the subjects succeeded in doing the experiment.

Data and statistical analysis

The type of data analyzed in this experiment have been considered for experiment 1.a, see “Data and statistical analysis” for details.

The aim here is to validate the hypothesis that the removal of ideal and non-ideal markers has a significant effect on the perceived pleasantness. To do so, we perform an analysis of variance (ANOVA). We consider \(\mathcal {A}_{subject}\) as a dependent variable, and the type of environment (ideal or non-ideal) as well as the presence/absence of markers as independent variables. As each subject evaluated the whole set of scenes, a two-factors repeated measures ANOVA with interaction is used to evaluate whether there exists a significant difference of perceived pleasantness between the scenes with and without markers. The two independent variables are considered as within-subject factors. The factors being of only two levels each (type: ideal or non-ideal, marker: with or without), the sphericity hypothesis does not need to be checked. Post hoc analyses are done using the Tukey–Kramer procedure.

All statistical significance tests are conducted with a critical threshold of \(\alpha = 0.05\)


Outliers detection

Let us consider \(\mathcal {A}_{subject}\) for the scenes with markers, see Fig. 8a. Close inspection of the answers shows that one subject’s judgments strongly differ from the others’. This subject evaluated positively more than half of the non-ideal scenes with markers (see Annex Appendix B, Fig. 12, subject 7). He gave a score above 0 for 58% of the non-ideal scenes with markers, contrary to an average of 11% for the other subjects.
Fig. 8

Distribution of the mean perceived pleasantness per subject \(\mathcal {A}_{subject}\) (a) and the mean perceived pleasantness per scene \(\mathcal {A}_{scene}\) (b) versus scene type (i: ideal, ni: non-ideal, km: with markers, rm: without markers). Black stars in subfigure (a) stand for the detected outlier, i.e., subject 7

Furthermore, this subject used the whole scale (from –5 to 5) to score both the ideal scenes and the non-ideal scenes. Those behaviors strongly differ with the ones of the other subjects, be they from this experiment or the previous ones. Subject 7 is thus discarded from the analysis.

Influence of the markers on perceived pleasantness

In this section, we study the scores given by the subjects while listening to the different types of scenes, namely the ideal scenes with markers, non-ideal scenes with markers, ideal scenes without markers and non-ideal scenes without markers, see Fig. 8b. The repeated measures ANOVA applied to \(\mathcal {A}_{subject}\) shows a significant effect of the environment type (F[1,10] = 175, \(p<0.01\)), of the presence/absence of markers (F[1,10] = 7, \(p<0.05\)), and of the interaction between those two factors (F[1,10] = 67, \(p<0.01\)).

The post hoc analysis exhibits significant differences between all groups of observations, notably between the ideal scenes with markers and the ideal scenes without markers (p < 0.05) as well as between the non-ideal scenes with markers and the non-ideal scenes without markers (p < 0.01).

Those results indicate that the removal of the markers indeed modify the perception of the scenes by the subjects, see Fig. 8. Our two hypotheses are thus verified:
  • the removal of non-ideal markers improved the pleasantness of the non-ideal scenes;

  • the removal of ideal markers reduced the pleasantness of the ideal scenes.

The significant interaction shows that the type of environment influences the effect of the presence/absence of the markers. Indeed, the average difference of \(\mathcal {A}_{subject}\) between the scenes with markers and the scenes without markers is larger for the non-ideal scenes (1.1) than for the ideal scenes (0.5).


This experiment shows that the presence of the markers identified during the analysis of experiment 1 does have an impact on the perceived pleasantness. The removal of the non-ideal markers has a positive effect on the perception of the non-ideal scenes. Perhaps more surprisingly the removal of the ideal markers slightly decreases the perception of the ideal scenes: this is a more striking observation since, due to the removal of the markers, the acoustic pressure level of the ideal scenes with markers is higher than the one of the ideal scenes without markers.

This strongly confirms that ideal markers do have a positive impact on the perception of an environment. The fact that their removal decreases \(\mathcal {A}_{scene}\) indicates that it should be possible to improve the perceived pleasantness of a given urban area by the addition of sounds commonly considered as pleasant such as bird calls. Those conclusions are in line with the positive approach introduced by Schafer (1977).

Outcomes for soundscape perception

This series of experiments showed that most of the descriptors used in this study, be they of a semantic or acoustic nature, allow us to distinguish between an ideal scene and a non-ideal one.

That being said, we observe that the physical characteristics correlated with the perceived pleasantness clearly differ depending on the type of scenes. In the case of ideal scenes, it is above all the emergence of sound markers that determines the perceived quality, whereas in the case of non-ideal scenes, it is the overall sound level that prominently influences it.

These results show that the perception of the qualities of a scene indeed depends primarily on its identifiable sound sources. The characteristics that are taken into account during the perceptual process appear to vary from one source to the other, from one type of environment to the other. This fact leads the authors to believe that there is little hope to find a holistic physical descriptor that can adequately account for the affective qualities of all types of sound environment.

Those results may have an impact on the relevant strategies to adopt while trying to improve the quality of a sound environment:
  • in the case of non-ideal scenes, one should focus on reducing the acoustic pressure level, whether globally, or by discarding specific sources such as sirens or car horns

  • in the case of ideal scenes, one should first identify which sources are pleasant to the targeted community, second lower the volume of the other sound sources, and, if possible, raise the contribution or add positive sound markers.

The present results allow us to conjecture as to the nature of the mental representations of the concepts “pleasant urban sound environment” and “unpleasant urban sound environment”.

First, the fact that the semantic information (which sound source is present) and structural information (at which level) are different for ideal and non-ideal scenes leads us to believe that these two types of information characterize the above cited concepts.

Second, the fact that the removal of sound markers changes the perceived pleasantness leads us to believe that the abstract concept related to pleasantness depends on the activation of a network of concepts strongly linked to the sources which are in the case of this study: bird, church bell and bicycle bell.

Related computational approaches

Considering sound scene synthesis or composition as part of an experimental protocol requires the availability of software resources that are simple to manipulate for subjects with little training, and possibly running on many types of hardware. While most processing frameworks that have been considered are developed in native language for efficiency purposes like Marsyas developed by Tzanetakis and Cook (2000) or Clam developed by Amatriain et al., (2006), many advanced sound manipulations can now be performed on modern web browsers using the webaudio library.

Closer to our sound model is the Tapestrea framework proposed by Misra et al., (2007) that allows the manipulation of sounds through wavelet and sinusoidal modeling for acoustic composition or sound design. The main issue with considering this framework for soundscape perception studies is that the sounds are processed using sound models that would strongly decrease the ecological validity of the study. That being said, the sequencing model is quite close to the one considered in this study.

The experiments described here does not introduce yet another sound manipulation framework and is thus not designed to be extensible. Rather, it is considered here to demonstrate the following:
  • the use of Webaudio JavaScript library is useful to design advanced experimental protocols that consider sound manipulation over the Internet

  • the considered soundscape model is useful to question important research questions within the soundscape community but simple enough to be grasped by novice participants.

Extensions to other sound parameters beyond acoustic levels and sound events distribution could be integrated as the software is open source. However, the control of those new parameters would need a specific user experience study to maintain a sufficient degree of usability of the controlling interface.


The outcomes of the these experiments described in this paper demonstrate the usefulness of considering a dedicated simulation tool such as simScene in order to scientifically question the perception of soundscapes in an innovative way. We also believe that its wider usage could enable urban planning decision-makers to question an entire community about its own representations of the sound environment to which they are exposed, and about the representations of the sound environments to which they would like to be exposed to.

Future work should consider several other simulation experiments by changing the emotional qualities (quiet, comfortable, troublesome, etc.), but also by targeting specific urban locations (park, square, street, etc.), in order to provide to the scientific community an entire corpora of cognitively informed soundscapes.

There are also many more avenues of research to fully explore the capabilities of the proposed paradigm; first by taking into account a wider range of structural features (for example the density or regularity of events), and second by studying further the effects caused by the voluntary modification of scene composition, as during the suppression of the sound markers practiced in experiment 2.

One interesting avenue in this direction would be to study the impact of adding positively appreciated sound markers to a non-ideal scene in order to study the hypothesis that this type of addition would improve the perceived quality of the scene.

Finally, one should study the influence of socio-cultural contexts on perception. Indeed, if the sound of the church bell is most often a quality environment marker for a Westerner, this does not necessarily hold true for subjects of Eastern, Middle Eastern, or other cultures.

Once again, besides the interesting possibilities already mentioned, the simulation protocol presented here as well as its implementation brings in this case three advantages:
  • The simulator can be deployed on a large scale via the Internet thanks to the Web-based software architecture, provided that the sound databases are adapted to the cultural background of the test subject population;

  • Simulated scenes can be analyzed without the need to take into account the different mother tongues of the subjects, since the semantic natures of the used sound classes are known a priori by the experimenter;

  • Simulated scenes can be analyzed without the need to annotate the sound scene to identify the sources, since their occurrences in the simulated scene are directly available.




Research project partly funded by ANR-11-JS03-005-01. The authors would like to thank the students of the Ecole Centrale de Nantes for their willing participation.


  1. Aletta, F., Kang, J., & Axelsson, O. (2016). Soundscape descriptors and a conceptual framework for developing predictive soundscape models. Landscape and Urban Planning, 149, 65–74.CrossRefGoogle Scholar
  2. Amatriain, X., Arumi, P., & Garcia, D. (2006). Clam: A framework for efficient and rapid development of cross-platform audio applications. In Proceedings of the 14th ACM international conference on Multimedia (pp. 951–954): ACM.Google Scholar
  3. Axelsson, Ö., Berglund, B., & Nilsson, M.E. (2005). Soundscape assessment. The Journal of the Acoustical Society of America, 117, 2591–2592.CrossRefGoogle Scholar
  4. Beaumont, J., Lesaux, S., Robin, B., Polack, J. -D., Pronello, C., Arras, C., & Droin, L. (2004). Pertinence des descripteurs d’ambiance sonore urbaine. Acoustique et techniques.Google Scholar
  5. Bostock, M., Ogievetsky, V., & Heer, J. (2011). D3 data-driven documents. IEEE Transactions on Visualization and Computer Graphics, 17, 2301–2309.CrossRefPubMedGoogle Scholar
  6. Brown, A., Kang, J., & Gjestland, T. (2011). Towards standardization in soundscape preference assessment. Applied Acoustics, 72, 387–392.CrossRefGoogle Scholar
  7. Bruce, N.S., Davies, W.J., & Adams, M.D. (2009). Development of a soundscape simulator tool. In: Proceedings of the 38th International Congress and Exposition on Noise Control Engineering (InterNoise). Ottawa, Canada.Google Scholar
  8. Bruce, N.S., & Davies, W.J. (2014). The effects of expectation on the perception of soundscapes. Applied Acoustics, 85, 1–11.CrossRefGoogle Scholar
  9. Cain, R., Jennings, P., & Poxon, J.E. (2013). The development and application of the emotional dimensions of a soundscape. Applied Acoustics, 74, 232–239.CrossRefGoogle Scholar
  10. Davies, W.J., Adams, M.D., Bruce, N.S., & et al. (2013). Perception of soundscapes: an interdisciplinary approach. Applied acoustics, 74, 224–231.CrossRefGoogle Scholar
  11. Defréville, B., Lavandier, C., & Laniray, M. (2004). Activity of urban sound sources. In: Proceedings of the 18th International Congress in Acoustics (ICA). Kyoto, Japan.Google Scholar
  12. Devergie, A. (2006). Relations entre Perception Globale et Composition de séquences Sonores. Master’s thesis IRCAM, Paris VI UPMC.Google Scholar
  13. Dubois, D., Guastavino, C., & Raimbault, M. (2006). A cognitive approach to urban soundscapes: Using verbal data to access everyday life auditory categories. Acta acustica united with acustica, 92, 865–874.Google Scholar
  14. Gozalo, G.R., Carmona, J.T., Morillas, J.B., Vílchez-gómez, R., & Escobar, V.G. (2015). Relationship between objective acoustic indices and subjective assessments for the quality of soundscapes. Applied Acoustics, 97, 1–10.CrossRefGoogle Scholar
  15. Guastavino, C. (2003). Etude sémantique et acoustique de la perception des basses fréquences dans l’environnement sonore urbain, (Semantic and acoustic study of low-frequency noises perception in urban sound environment). France: Ph.D. thesis Unversité Paris VI UPMC Paris.Google Scholar
  16. Guastavino, C. (2006). The ideal urban soundscape: Investigating the sound quality of French cities. Acta Acustica united with Acustica, 92, 945–951.Google Scholar
  17. Kuwano, S., Namba, S., Kato, T., & Hellbrück, J. (2003). Memory of the loudness of sounds in relation to overall impression. Acoustical Science and Technology, 24(4), 194–196.CrossRefGoogle Scholar
  18. Lafay, G., Rossignol, M., Misdariis, N., Lagrange, M., & Petiot, J. -F. (2014). A new experimental approach for urban soundscape characterization based on sound manipulation: A pilot study. In Proceedings of the International Symposium on Musical Acoustics (ISMA) Le Mans. France: SFA.Google Scholar
  19. Lafay, G., Misdariis, N., Lagrange, M., & Rossignol, M. (2016). Semantic browsing of sound databases without keywords. Journal of the Audio Engineering Society, 64, 628–635.CrossRefGoogle Scholar
  20. Lavandier, C., & Defréville, B. (2006). The contribution of sound source characteristics in the assessment of urban soundscapes. Acta Acustica united with Acustica, 92, 912–921.Google Scholar
  21. Leobon, A. (1986). Analyse psycho-acoustique du paysage sonore urbain (Psychoacoustic analysis of urban soundscape). Ph.D. thesis Unversité Louis Pasteur Strasbourg, France.Google Scholar
  22. Maffiolo, V. (1999). De la caractérisation sémantique et acoustique de la qualité sonore de l’environnement urbain, (Semantic and acoustical characterisation of the sound quality of urban environment). Ph.D. thesis Université du Mans Le Mans, France.Google Scholar
  23. McDermott, J.H., & Simoncelli, E.P. (2011). Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis. Neuron, 71, 926–940.CrossRefPubMedPubMedCentralGoogle Scholar
  24. McDermott, J.H., Schemitsch, M., & Simoncelli, E.P. (2013). Summary statistics in auditory perception. Nature Neuroscience, 16, 493–498.CrossRefPubMedPubMedCentralGoogle Scholar
  25. Misra, A., Wang, G., & Cook, P. (2007). Musical tapestry: Re-composing natural sounds. Journal of New Music Research, 36, 241–250.CrossRefGoogle Scholar
  26. Nelken, I. (2013). De cheveigné, A An ear for statistics. Nature Neuroscience, 16, 381–382.CrossRefPubMedGoogle Scholar
  27. Niessen, M.E., Cance, C., & Dubois, D. (2010). Categories for soundscape: Toward a hybrid classification. In: Proceedings of the 39th International Congress and Exposition on Noise Control Engineering (InterNoise), (pp. 5816–5829). Lisbon, Portugal.Google Scholar
  28. Nilsson, M.E. (2007). Soundscape quality in urban open spaces. In: Proceedings of the 36th International Congress and Exposition on Noise Control Engineering (InterNoise). Istanbul, Turkey.Google Scholar
  29. Polack, J. -D., Beaumont, J., Arras, C., Zekri, M., & Robin, B. (2008). Perceptive relevance of soundscape descriptors: a morpho-typological approach. Journal of the Acoustical Society of America, 123, 3810.CrossRefGoogle Scholar
  30. Raimbault, M. (2002). Simulation des ambiances sonores urbaines: intégration des aspects qualitatifs Urban soundscape simulation: focusing on qualitative aspect). Ph.D. thesis Unversité de Nantes - Ecole polytechnique de NantesNantes, France.Google Scholar
  31. Raimbault, M., & Dubois, D. (2005). Urban soundscapes: Experiences and knowledge. Cities, 22, 339–350.CrossRefGoogle Scholar
  32. Rakotomalala, R., & Morineau, A. (2008). The TVpercent principle for the counterexamples statistic. In Statistical Implicative Analysis (pp. 449–462): Springer.Google Scholar
  33. Ricciardi, P., Delaitre, P., Lavandier, C., Torchia, F., & Aumond, P. (2015). Sound quality indicators for urban places in Paris cross-validated by Milan data. The Journal of the Acoustical Society of America, 138, 2337–2348.CrossRefPubMedGoogle Scholar
  34. Rosch, E., & Lloyd, B.B. (1978) Cognition and categorization. New Jersey: Hillsdale.Google Scholar
  35. Rossignol, M., Lafay, G., Lagrange, M., & Misdariis, N. (2015). Simscene: a Web-based acoustic scenes simulator. In: Proceedings of the Web Audio Conference (WAC) Paris. France: IRCAM.Google Scholar
  36. Saint-Arnaud, N. (1995). Classification of sound textures. Master’s thesis Massachusetts Institute of Technology.Google Scholar
  37. Salamon, J., Jacoby, C., & Bello, J.P. (2014). A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia Orlando. FL, USA.Google Scholar
  38. Schafer, R. (1977). The Tuning of the World. Borzoi book. New York: Knopf. (Reprinted as Our Sonic Environment and the Soundscape: The Tuning of the World. Destiny Books 1994).Google Scholar
  39. Schafer, R.M. (1993). The soundscape: Our sonic environment and the tuning of the world. Simon and Schuster.Google Scholar
  40. Szeremeta, B., & Zannin, P.H.T. (2009). Analysis and evaluation of soundscapes in public parks through interviews and measurement of noise. Science of the Total Environment, 407, 6143– 6149.CrossRefPubMedGoogle Scholar
  41. Tzanetakis, G., & Cook, P. (2000). Marsyas: a framework for audio analysis. Organised Sound, 4, 169–175.CrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2018

Authors and Affiliations

  1. 1.Laboratoire des Sciences du Numérique de Nantes-CNRS-École Centrale de NantesNantesFrance
  2. 2.STMS Ircam-CNRS-SU Institut de Recherche et Coordination Acoustique/MusiqueParisFrance

Personalised recommendations