1 Introduction

Due to the significant amount of interest from various sectors (e.g. tourism, education, medicine, military, industry) in its applications and innovative techniques (Lawson et al. 2015; Maffei et al. 2015; Schofield et al. 2018; Teofilo et al. 2018), Virtual Reality (VR) represents one of the most innovative and revolutionary tools of recent years. One of its main features is to help designers create virtual environments with specific rules, providing potential users holistic experiences and ecological interactions. In the beginning, it was merely used as a real-time visual rendering platform. However, VR does not reproduce just visuals. It can now include several additional sensory modalities, i.e. sound (Maffei et al. 2015; Erkut et al. 2018; Schoeffler et al. 2015), touch (Richard et al. 2006; Ammi and Katz 2015) and smell (Jiang et al. 2016; Cheok and Karunayaka 2018; Kerruish 2019).

Nowadays, game engines are the most widely used software platforms for VR developers. They are preferred because they provide reusable components to VR designers, are interfaced with several hardware devices (e.g. HMD, gloves, motion tracking), and upgraded and implemented with new and specific software and plugins. Even if these engines, such as Blender, Unity3D, CryEngine or Unreal Engine 4 (UE4), have reached very high levels of realism, especially the visual aspects, their approach is mainly based on the parametric control of the scene inputs. Although this approach is widely accepted by the gaming and entertainment industries, it needs to be validated for architecture and engineering applications, where VR is used for design and assessment purposes. This new and ambitious game engine's role would require reproducing the surrounding environment by phenomenological relationships and not just by a parametric approach. Efforts are addressed to reach plausible reproduction of the real environments.

The physically based rendering (PBR) revolution was one of the most important steps towards the physically based modelling idea (Pharr et al. 2017). From this perspective, due to the predominant role of the visual aspect in game engines, some improvements have already been introduced in the lighting area with the control of luminance and colour temperatures as well as the importing of IES profiles which is a lighting industry-standard method of physical photometric data distribution (Scorpio et al. 2020). On the other hand, less attention has been given to sound. Most of the recent studies are descriptive and introductory (Garner 2018; Llorca 2018; Beig et al. 2019), with only a few having carried out any systematic research on 3D sound reproduction in game engines (Schissler and Manocha 2011; Schröder and Vorländer 2011; Poirier-Quinot et al. 2017; Cuevas-Rodríguez et al. 2019; Amengual et al. 2019).

This paper focuses on the physically based sound rendering performance of one of the leading game engines (UE4) and the possible improvements using a middleware (Wwise) specialized in 3D interactive audio design. The study's goal is to assess how far one of the most popular game engines successfully renders 3D spatial audio. What advantages come with the use of other plug-ins? Despite better choices for acoustic simulation, most of the 3D spatial audio production in recent years has been done with these and similar software. The results of this study can be beneficial to the VR community and architectural acousticians. Throughout the study, both the game engine and middleware's mode of operation is explained comparatively. The game engine's performance alone and its use with the middleware were analysed with some case studies regarding direct sound, early reflections and late reverberations.

1.1 Auralization

Several methods and techniques have been developed in acoustics to model, simulate and measure indoor and outdoor sound fields (Krokstad et al. 2015). Since the pioneering study by Krokstad, Strøm and Sørsdal (1968), the progress in information technology and acoustics led to the development of building acoustic software capable of calculating impulse responses and simulating sound propagation in complex-shaped rooms by using Geometrical Acoustics (GA) algorithms. These algorithms can provide accurate results “as long as the room's dimensions are large compared to the wavelengths and if broadband signals are considered” (Vorländer 2008). Hybrid techniques (Siltanen et al. 2010) which are combinations of wave-based methods (WBM), such as Finite Difference Methods (FDM) (Botteldooren 1995), Finite Element Method (FEM) (Craggs 1998), and Boundary Element Method (BEM)) (Hamilton 2016; Hargreaves and Cox 2009) for the frequencies below the critical frequencyFootnote 1, and geometrical acoustic methods such as Image Source Method (Allen and Berkley 1979), Ray and Particle Tracing (Schroeder 1969), Beam and Cone Tracing (Vian and van Maercke 1986), and Acoustic Radiosity (Lewers 1993) for above the critical frequency have recently been used to overcome this problem (Schröder 2011). On the other hand, mostly hybrid methods combining two or more GA methods are used in the commercial tools such as CATT-Acoustic, Odeon, and EASE (Ahnert and Feistel. 1993; Naylor 1993; Savioja et al. 1999; Dalenback 2014) because hybrid methods of WBM are challenging to implement and require more computational power. An overview of these techniques and the general trends in room acoustic studies can be found in the study of Savioja and Svensson (2015). These hybrid acoustic simulation methods allow obtaining an accurate auralization, even though long calculation times are still necessary for complex and large rooms (Naylor 1993; Schröder and Vorländer 2011). The term, auralization was introduced by Kleiner et al. (1993) “to be used in analogy with visualization to describe rendering audible (imaginary) sound fields”. This static acoustics simulation approach still represents a baseline for both commercial and academic users in architectural acoustics. One of the restrictive aspects of this approach is that it requires pre-calculations. The results are for specific source-receiver locations at fixed head orientations. More recently, a new field of research called Virtual Acoustics has emerged. Whether it is difficult to set a line between these two approaches, the latter one represents the dynamic auralization concept. It focuses on reproducing the 3D sonic environment and its characteristics to provide listeners with a sensation of being immersed in a simulated or recorded acoustic environment (Vorländer et al. 2015; Dodds et al. 2019).

The increased calculation power of computers has led software designers to use real-time ray tracing methods in their applications. Several systems that deal with the different aspects of virtual acoustics have already been studied by different scholars and research groups. Pioneering works have been carried out by Savioja, Lokki (Savioja et al. 2002; Siltanen et al. 2014; Torres et al. 2014), Schröder and Vorländer (Schröder and Vorländer 2011; Vorländer et al. 2015; Vorländer 2016, 2020). The research group of the Helsinki University of Technology implemented an auralization engine called DIVA (Digital Interactive Virtual Acoustics) (Savioja 1999; Lokki 2002), which has been used after in the EVE Virtual Reality project (e.g. Grübel et al. 2017; Hackman et al. 2019). Another real-time simulation framework RAVEN (Room Acoustics for Virtual Environment) was developed at the RWTH Aachen University by the research group of Michael Vorländer and Dirk Schröder (Schröder 2011). The RESound is another algorithm designed to work for interactive sound rendering for virtual environments by a research group at North Carolina University (Taylor et al. 2009). It has the most similar framework with Wwise that is assessed in this study. An additional real-time auralization engine, EVERTims, was developed by Katz et al. (Noisternig et al. 2008) and integrated with Blender. For a period, it was the only free and open-source kit on the market until the Virtual Acoustics (VA) framework was issued, the open-source version of RAVEN, by the Institute of Technical Acoustics, RWTH Aachen University (2018). REVES was also developed by INRIA, with its main focus area being a large number of sources in VR scenes (Moeck et al. 2011). Other applications, such as the CATT-Walker module, are in-between solutions that calculate IRs at each receiver position of an audience grid area and interpolate these IRs to present a continuous walk-through, immersive simulation (Dalenbäck and Strömberg 2006).

On the other hand, the interest in 3D audio created a new converging point for the signal processing approaches in audio design and physical rendering methods in building acoustics. In recent years, the game audio has become a field of intersection for both approaches. Today, game engines are the most common solution for creating virtual scenarios, with the tools focusing on 3D audio spatialization and developed 3D audio design techniques due to the requirements of creating 3D virtual environments.

In addition to each game engine's individual endeavours to improve their audio engines and sound reproduction quality, there are a series of tools that directly focus on VR audio with specific aims. As Garner (2018) outlined, Audio Source Development Kits (SDKs) produced by VR firms (e.g. Oculus Audio SDK, VR Works Audio SDK, 3D Tune-In Toolkit), third-party plugins (e.g. Steam Audio) and middleware tools that can create sound objects and events separately, (e.g. FMOD, Wwise—Wave Works Interactive Sound Engine by Audiokinetic) are the three main options. It is also worth mentioning other plug-ins, such as Real Space 3D Audio, Csound, Slab3d, hrir, earplug, Anaglyph, HOBA-VR, OpenAL Soft, SoLoud, Microsoft Spatial Audio, Google Resonance Audio, VA, EVERTims, which are produced by VR audio community, but they all have different targets. The aim of Google Resonance is to create binaural audio by providing HRTF modules for mobile devices. Regarding Oculus Audio, with the words of Dan Reynolds; "focused on simulating spatialisation and not as heavy in creating physically simulated acoustic environments but rather algorithmic approximations of spaces" (Dan Reynolds 2018). Microsoft Spatial Audio kit also has a similar understanding with Oculus that targets spatialization more. A summary of the comparison between popular open-source audio spatialization tools is provided by Cuevas-Rodríguez et al. (2019, pp. 26–27).

Besides, Steam Audio, EVERTims, VA by the RTWH Aachen University, and Wwise are the prominent options considering physically based modelling. While Steam Audio and Wwise are targeted to game audio design, EVERTims and VA focus on architectural acoustics. Both EVERTims and VA require a certain level of coding (C + +) knowledge that restricts the major part of architectural acousticians or users without any coding experience. They are also insufficient to perform different game audio design tasks, including creating imaginary worlds, musical and soundscape compositions.

On the other hand, Steam Audio and Wwise are easier to use and may be considered in-between solutions for game audio design and physically based modelling. Steam Audio uses a geometrical acoustic method to render spatial audio with a very similar static acoustics software idea. While it gives a real-time spatialization option, it also has a pre-calculation so-called baked choice to decrease CPU usage of audio thread. It also supports HRTFs (including custom HRTF data) (Valve Corp. 2017). Audiokinetic's Wwise on the other side is a compilation of several plug-ins and tools, assessed from the point of architectural acoustics in the following sectionsFootnote 2. After the first two analyses that focused on attenuation (Masullo et al. 2018a) and reverberation (Masullo et al. 2018b) functions of UE4, in this study, Wwise was chosen for further analysis together with UE4.

2 Sound propagation

In theory, the analysis of sound propagation is done in the three main stages; source, transmission path and receiver. This article dwells on the acoustic propagation where the sound is reflected, scattered, diffracted, absorbed or occluded. Although most of the recent researches have been focused on the localization and HRTF functions for 3D interactive audio (Hu et al. 2008; Gan et al. 2017; Berger et al. 2018; Geronazzo et al. 2018; Amengual Gari et al. 2020), all the phases of dynamic acoustic simulations need to be investigated in depth. Another common practice to work on sound propagation modelling is to divide reflections into three parts, which are as follows: (1) Direct Sound, (2) Early Reflections, (3) Late Reverberations.

Direct Sound (DS) is the energy transmitted directly from one source to a receiver without any reflection, theoretically in free-field condition. It is mainly used for source localization. The energy decrease of this sound is directly linked to the sound propagation path in which sound travels. The speed of sound within medium, viscosity and the motion of medium are the factors that affect sound propagation.

Early Reflections (ER) is the energy that arrives at the receiver at first 50 or 80 ms after direct sound arrival. Low order first reflections compose of the early reflections. They also carry important to regard to information of source localization. They affect the width and timbre of sound and its spaciousness.

Late Reverberations (LR) is the remaining energy reflected, which arrives at the receiver after ER. Late reverberation lacks directional information, and it forms the diffuse-like sound field.

Figure 1 shows the temporal and spatial evolution of the room's response after an impulsive sound is played, the impulse response (IR). The green-ray shows the direct sound, the red rays show the early reflections, and the remaining rays, shown in yellow, are the late reflections.

Fig. 1
figure 1

Direct sound, early reflections and late reverberations

Most room acoustical parameters, such as reverberation time, clarity, definition and lateral energy, are directly derivable from the IR (Schröder 2011). The short descriptions of the parameters used in this study are provided below.

2.1 Room acoustic parameters

The Reverberation Time (T30) and the Clarity (C50) were used to analyse reflection simulation with UE4-Wwise further. The reverberation is the most characteristic acoustic phenomenon for the enclosed spaces: when an impulsive sound source played in a room it will not disappear immediately, the time it takes for sound to decay by 60 dB is called as reverberation time (RT). Its calculation must be done by performing linear interpolation of the decay curve where it is not possible to observe 60 dB decay. On the other hand, according to Sabine equation, the RT in diffuse field can be calculated as follows:

$$RT=0.16\frac{V}{S\alpha }$$
(1)

where V is the room's volume, S is the total surface area and α is the average absorption coefficient.

The ratio of early reflections and late reverberation is defined as clarity (Reichardt et al. 1974). It is calculated by the ratio between the sound energy that comes to the receiver at first 50 ms and the rest of the energy. The critical time threshold, which is 50 ms mostly used for the speech (C50) while 80 ms is used for music (C80) (Vorländer 2008). It gives an image of the balance between early reflections and late reverberations. C50 is mainly used for intelligibility analysis because the early reflections support speech intelligibility.

$$C_{50} = 10\log \frac{{\int\nolimits_{0}^{50ms} {p^{2} } \left( t \right)dt}}{{\int\nolimits_{0}^{50ms} {p^{2} } \left( t \right)dt}}$$
(2)
$$C_{80} = 10\log \frac{{\int\nolimits_{0}^{80ms} {p^{2} } \left( t \right)dt}}{{\int\nolimits_{0}^{80ms} {p^{2} } \left( t \right)dt}}$$
(3)

3 Objectives

This study is a part of a broader project on developing augmented audio reality applications for—historical—soundscape design and aims to find the appropriate tools to support spatial audio rendering in historical reconstructions. The main aim is to assess UE4 and Wwise's spatial sound rendering performances. The assessment focused on free-field propagation and reverberation is based on the comparisons with a static acoustics simulation software (Naylor 1993). All the results and discussions that are presented here apply only to the 4.18 version UE4 and 2018.1.10.6964 version of Wwise. At the very beginning, the primary acoustic phenomena, such as geometrical divergence attenuation, air absorption, and reverberation were taken into account, then more complex phenomena of room acoustics, such as early reflections and late reverberations, were also evaluated. Other important acoustics phenomena like localization, diffraction, occlusion, transmission, scattering or directivity were not considered at this stage. A summary of the conducted tests in the study can be seen in Fig. 2.

Fig. 2
figure 2

A summary of the conducted tests in the study

4 Sound propagation in UE4-W wise

The details and results of the analysis on UE4 and Wwise’s performance to simulate sound propagation are presented in the following sections. The analysis is divided into two parts: 1) direct sound simulation, where geometrical divergence and air absorption attenuations took place, 2) reverberation simulation which is devoted to convolution and real-time ray tracing method of Wwise.

4.1 Direct sound simulation

In free-field conditions, two important factors affect sound propagation. The first is the distance between the source and receiver, and the second is the absorption level of the transmission medium where sound propagates. ISO standardizes the calculation of these effects under ISO 9613–2 Acoustics—Attenuation of sound during propagation outdoors (ISO 9613–2 1993). These two main variants of attenuation, described in the standard as Geometrical Divergence Attenuation (Adiv) and Atmospheric Absorption Attenuation (AAtm), are used by UE4, Wwise and Steam Audio to simulate the sound propagation under different physical conditions.

Considering the free-field sound propagation, UE4 allows simulating Geometrical Divergence Attenuation (Adiv) by six different attenuation functions: Linear, Logarithmic, Natural Sound, Log Reverse, Inverse and Custom. These functions are used to meet the various needs of game audio design. The functions obtained through the source code of UE4's audio engine were listed below. All of them work for distances greater than the inner radius and less than the falloff distance (see definitions below). It should also be noted that UE4 uses the linear scale for sound volume instead of a dB scale. These functions result in linear scale as the new sound level at the distance d.

$$Linear=1.0\mathrm{ f}-\left(\frac{d}{fod}\right)$$
(4)
$$Logarithmic=0.5\mathrm{ f}-log\left(\frac{d}{fod}\right)$$
(5)
$$Inverse=0.02\mathrm{ f }/\left(\frac{d}{fod}\right)$$
(6)
$$LogReverse=1.0\mathrm{ f}+0.5 log\left(1-\frac{d}{fod}\right)$$
(6)
$$Natural Sound={10.0\mathrm{ f}}^{\left(\left(\frac{d}{fod}\right)*\left(\frac{dBMax}{20}\right)\right)}$$
(7)

where:

f: Attenuation scale factor \(f=\frac{Fall\,off\,Distance}{\mathrm{Inner\,Radius}}\)

Inner radius: minimum distance from the sound source where attenuation will be applied

fod: Falloff Distance is the maximum distance from the sound source where attenuation will be applied;

d: current distance of receiver to the sound source;

dBMax: maximum dB attenuation at Falloff Distance.

The free-field condition was modelled in UE4 to measure its compliance with the physical rules of geometrical divergence sound attenuation. The measurements layout consists of one sound source (S) and 14 receivers (R) positioned at different distances (1 to 8192 m) doubling the distance from 1 m (Fig. 3).

Fig. 3
figure 3

A symbolic representation of UE4's attenuation test setup (distances are presented in meter, and in logarithmic scale)

The attenuations measurements were carried out within the anechoic chamber of the Department of Architecture and Industrial Design using a workstation (Intel Core i7 CPU 2.9 GHz processor, Radeon Pro 560 4 GB, Intel HD Graphics 630, 1536 MB graphic card). The sound stimuli were played back via headphones (Beats Solo 3 Wireless Headphone A1795) and recorded with a calibrated Mk1 Cortex Manikin connected to a Symphonie 01 dB soundcard and the software dBTrig. The sound source's wave file (44.1 kHz, 16 bit) used for the test was white noise.

Two different factors scale of attenuation during the session: f1 = 100/1 and f2 = 1000/1 were considered to understand its effect. In both cases, the inner radius was kept as 1 m while changing the fallof distance 100 m to 1000 m. The average (left/right) sound level produced by the UE4 functions: Linear, Logarithmic, Inverse, Natural (60 dB), Natural (40 dB) and Log Reverse were considered for each receiver points. The maximum level of the attenuation applied at falloff distance must be set for the Natural function. The 60 dB and 40 dB values were set for f1 and f2, respectively. Inner radius was set at 1 m due to the near-field effect (Spagnol et al. 2015). The measurement results of the five attenuation functions at different distances were compared with the physical spherical free field decays and are plotted in Fig. 4 (for scale factor f1 and f2) in the function of the sound pressure level (SPL).

Fig. 4
figure 4

Comparison of the SPLs measurements for six UE4 attenuation functions and spherical free field decay, as a function of distance (for f1 on the left and f2 on the right)

For the Logarithmic, Log Reverse and Linear functions, the sound levels above the Falloff Distances were due only to the background noise level (Leq = 25,2 dB) in the chamber. At first glance, the graphs show that none of these attenuation's functions is in line with the physical sound propagation. Only the Inverse function decay shows a parallel section with the physical decay after being stable at the beginning. When the two different factor scales are compared, it can be argued that all UE4 attenuation functions show their effect close to the falloff distance except the Inverse attenuation function. The sudden drops just close to the falloff distance in all of the LogReverse, Linear and Logarithmic functions make falloff distance more critical.

On the other hand, Wwise provides an easily editable curve configuration interface, though there is no simple choice to apply a physical curve. However, any curve can be shaped through the provided Attenuation Editor and applied to the sound sources as required. It is also possible to add a low or high pass filter in the same editor, which can mimic air absorption effect.

The Steam Audio occlusion plugin allows to model occlusion, air absorption, physically based attenuation and directivity. To use it, an occlusion settings asset must be created in UE4. The default functions are provided by Steam Audio, and each of physics-based attenuation and air absorption options can be activated by only putting a checkmark. According to the Steam Audio user guide; "When checked, the physics-based distance attenuation (inverse distance falloff) is applied to the audio and frequency-dependent, distance-based air absorption is applied to the audio. Higher frequencies are attenuated more quickly than lower frequencies over distance" (Valve Corp. 2017). Both of these options were assessed, respectively. The occlusion mode which serves to simulate sound absorption through solid objects was not assessed in this study but Stem Audio provides two options for occlusion method as Raycast and Partial which might be frequency dependent or independent.

The same setup with the UE4 geometrical divergence attenuation test was used to evaluate the Wwise Attenuation asset, Steam Audio physical attenuation feature and a blueprint script written by the authors to imitate the distance-based sound attenuation (Fırat 2021). The blueprint scripting feature of UE4 was used as a third way, and a function was written in BP to adjust the output volume based on the receiver-source distance. The performance of the distance-based attenuations applied by Wwise, Steam Audio and BP script is presented in Fig. 5.

Fig. 5
figure 5

Distance-based attenuation curves of different methods

The results show that, except for the near field effect observed up to the firsts 2 m, the three methods have only small deviations from the physical distance attenuation curve. Contrary to UE4, the results show that all three options yield consistent results with the physical laws.

4.1.1 Air absorption attenuation (A atm )

In UE4, the Air Absorption Attenuation (Aatm) is simulated by the source-receiver distance and an unspecified Absorption Method (Linear or Custom). The max–min range values and low-pass cut-off frequency must be defined to use it. Whilst the physical calculation of air absorption is based on the temperature, humidity and pressure changes, it is possible to say that this feature is not based on any physical rules without any quantitative examination (Masullo et al. 2018a). It might work similarly to a low pass filter, but this is not specified as previously mentioned. The experiments carried out using several setups have shown that UE4 Air Absorption Attenuation does not provide any changes in the sound pressure level even at long distances.

As for Wwise, instead of including air absorption or providing total control of the frequency domain, it similarly provides a low pass filter (LPF) to mimic air absorption effect at high frequencies. Besides, this LPF is based on the distance and percentage of the output volume variables. It was assessed together with the default option of Steam Audio. The same setup with the geometrical divergence attenuation test was used for air absorption attenuation test. In this case, sound sources were pure tones in the centre frequency of octave bands. Measurements taken for each frequency bands from 63 to 8 kHz show that likewise Wwise, Steam Audio implements different filters that are not coherent with the curves defined in ISO-9613 for different air conditions. The following graphs show the differences of both cases from the physical air absorption (see Figure 6).

Fig. 6
figure 6

Performance of air absorption effects of Wwise, steam audio and the physical air absorption drop at 20 °C and 70% humidity (Left: Wwise LPF, Middle: Steam audio air absorption effect, Right: Physical curve)

In the case of Wwise, the tested LPF was the default curve provided in the attenuation asset. The results for Wwise did not reveal any real differences between the frequency bands up to a critical distance. Steam Audio follows more similar curves with the physical air absorption attenuation which is presented in the last graph. But each frequency's behaviour is not homogeneous. It also has an unclear behaviour up to 150–200 m where there is an immediate increase at the high frequencies except 8 kHz, where there is a decrease. In addition, the drop in the sound pressure level is not coherent with the physical air absorption attenuation in both Wwise and Steam Audio.

4.2 Reverberation simulation

UE4 uses a parametric approach for the modelling of the reflections rather than a physical one. It is based on using a set of the reverb asset parameters, allowing users to define their reverb configuration. However, almost none of the parameters in the UE4 reverb asset are used in room acoustic studies. This makes it relatively cumbersome to create or mimic a measured or calculated impulse response due to the insufficient documentation available (in the latest version of UE4.25 convolution reverb is provided with Synthesis and DSP Effects plugin). The Decay Time was determined as the main factor of the reverberation time in UE4, and its effects on the duration of reverberation time were investigated. The IR measurements were carried out according to different decay time levels starting from 1 to 20 without changing the other parameters' default values. The shoebox shaped reverb volume, which is necessary to define the volume where reverb will be applied, was used. The volume of the room or the receiver's position does not affect the UE4 reverberation; it creates the same value of reverberation everywhere inside the reverb volume. The IR measurements taken to understand the effect of decay time function are presented in Fig. 7. The results show that up to 5.0 s, decay time is in line with the reverberation time and gives a flat frequency response (Fig. 7).

Fig. 7
figure 7

The reverberation measurement results for the UE4 reverb asset’s decay time parameter (Masullo et al. 2018b)

At values greater than 5.0 s, the decay time curves start to vary for the frequencies. The lower frequencies continue to rise responsively on the decay time value changes.

On the other hand, following the conventional room acoustic studies, Wwise implements reflections modelling in three parts which consider, separately: DS, ER and LR (Alary 2017a). Three different audio buses have to be composed to simulate sound propagation in Wwise in line with Fig. 8. The audio busses and auxiliary busses help to organise the delivery of sound mix. The auxiliary busses are used to adjust volume, channel configuration, positioning, and apply Effects, States, or mixer plugins (Audiokinetic 2018a, p. 200). Each auxiliary busses carries different effects and plug-ins to imitate DS, ER and LR. As seen in the audio bus flow in Fig. 8, the master audio bus in the anechoic path carries only the Direct Sound's attenuation effect. Late Reflection bus has the Wwise Convolution Reverb plug-in, and the Early Reflection auxiliary bus has the Wwise Reflect plug-in. Before the final output, all mixed sound is binauralized in the Master Audio Bus with the help of Aura headphone plug-in.

Fig. 8
figure 8

Wwise audio bus flow

In contrast to UE4, Wwise has four different options to configure the reverberation time in these auxiliary busses. Two of these are Wwise Matrix Reverb and Wwise Roomverb. These are still parametric just like the UE4 Reverb Asset. However, Audiokinetic suggests using these two for the long reverberations (Audiokinetic, 2018a). The RoomVerb is an example of a feedback delay network (FDN) algorithm-based application developed by Jean-Marc Jot in the 1990s (Jot and Chaigne 1991). While the first is the simplest and lightest version of the reverb effects, the RoomVerb introduces some early reflection patterns to provide more control to the early part of the reverberation (Alary 2017b). Additionally, Wwise provides two more plug-ins to improve reverberation: Wwise Reflect, which is dedicated to early reflections, and Wwise Convolution Reverb plug-in, which is targeted to late or entire reverberations. These plug-ins were described in detail and analysed in the paper. First two parametric plug-ins were not considered because of their incompatible approach with PBR.

4.2.1 Wwise convolution reverb

The Wwise Convolution Reverb plug-in provides the only appropriate way to create reverberation with the PBR and uses a convolution-based reverberation approach. Together with the Wwise Convolution Reverb, it is possible to import calculated or measured Impulse Responses (IRs) into a project. The IR imported to Wwise can have any channel configuration (mono, stereo, ambisonics, etc.). It should be noted that three audio buses must be combined in Wwise as shown in Fig. 8 to have physically based dynamic sonic environment. Due to the static nature of the convolution process, Wwise Convolution Reverb plug-in does not provide any acoustical changes in line with the user’s location or rotation. Only the distance-based attenuation or panning manipulations can be applied to the sound for having the effect of head rotation or bodily movement (Audiokinetic 2018b).

4.2.2 Wwise reflect

This plug-in is used for the simulation of early reflections. As mentioned in the previous section, due to the static nature of using calculated or measured IRs, Wwise has developed another plug-in, Wwise Reflect, that uses the "image source technique" of a shoebox room to calculate early reflections. With the 2019.2.0 version of Wwise, it is possible to create complicated shaped rooms based on the meshes (Audiokinetic 2020a). It is not a complete geometrical acoustic calculation tool because of the necessity to limit CPU usage. For this reason, the calculation is limited to a maximum fourth reflection order. The Wwise Reflect uses an image-source method and implements a multi-tap time-varying delay line with filters to spatialize any early reflections (Audiokinetic 2018c). Since version 2019.2.0, a stochastic ray-tracing method is used in Wwise to compute reflections and diffractions (Audiokinetic 2020a; Buffoni 2020). To use Wwise Reflect, the architectural shape of the environment has to be designed by the Audio Volumes, and the Acoustic Textures have to be assigned to the surfaces of these volumes (Keklikian 2017a, b). In Wwise, the absorption level of the Acoustic Texture is set for the four absorption bands that are shown below;

  • Low: < 250 Hz

  • Mid Low: > 250 Hz and < 1000 Hz

  • Mid High: > 1000 Hz and < 4000 Hz

  • High: > 4000 Hz

Wwise Reflect has four maximum reflection orders, and this makes it focus on just the early part of the IR. It is not a complete acoustic calculation tool alone, it has to be used together with other reverberation plug-ins. It is designated to improve the quality of the produced spatial audio with only a limited extra load on the CPU usage. It aims to give the impression of immersion by providing dynamic changes on the early reflections which are more critical and valuable with the directional information (Audiokinetic 2018c; Simmonds 2019).

5 Case studies

Further computations dealt with the simulation of the early reflections and late reverberation. Four 3D models, shown in Table 1, were used as case studies for the experimental sessions on reflections. The 3D models used in the study are as follows: a simple box; a large box; a classroom and a conference hall, as seen in Table 1. All the case studies were firstly modelled with Odeon and then in UE4. The latest studies showed that the algorithms used in the static room acoustic simulation software could generate mostly plausible but not authentic auralizations regarding perceptual assessments (Brinkmann et al. 2019). However, Odeon and other competitors are the best choices for reliable acoustics simulations (Krokstad et al. 2015). Therefore it was used only as a benchmark in this study. The source-receiver plans for each of the model can be seen in the Appendix. The simple box and large box are the cubes with sides of 4 m and 40 m with the same absorption coefficient on all of the surfaces. Two different absorption levels were tested for both of the cubes, where the absorption coefficients were, respectively, 0.10 and 0.20. The classroom, which is planned to be constructed in the new campus of ITU in Northern Cyprus, was modelled with an optimal reverberation time, as specified by DIN 18041:2004–05 for the small to medium-sized rooms with the average of 0.45 reverberation time (DIN 2005). The conference hall is again an actual room of the Department of Architecture and Industrial Design.

Table 1 Case studies

All of the case studies listed in Table 1 were used to analyse the performance of Wwise Convolution Reverb (UE-Wwise). The case studies were chosen concerning the geometrical complexity of the environment from the simple to complex. The number of receivers and their positions were determined to represent the rooms' general acoustical characteristics by the least number of receiver points. All of the receiver points are out of the critical distance range. After calculating the IRs in Odeon for each source-receiver configuration, the same IRs were imported to Wwise Convolution Reverb for VE's auralization. A Dirac delta function was used for the auralizations. Finally, the auralizations carried out with Odeon and UE4-Wwise through the Wwise Convolution Reverb, were analysed and compared with the dBTrait software. The auralized wave files from Wwise were obtained through the Wwise Record plug-in.

The analysis on Wwise Convolution Reverb plug-in was carried out on Reverberation Time (T30) (see Fig. 9) and Clarity (C50) (see in Fig. 10) parameters. The results were compared with the Odeon auralization for the frequencies from 63 to 8000 Hz. In the graphs where the x-axis shows the Odeon values, and the y axis shows the Wwise values, with each point representing a single receiver. The linear line (y = x) on each graph has considered the study's objective where the aim is to obtain Wwise values close to the Odeon values that represent the correct value in theory. The correlation coefficient between two variables and the errors calculated as the deviation of the observed values for the Wwise Convolution Reverb and the maximum residuals express the maximum deviation from the Odeon values that occur among the receivers of each case.

Fig. 9
figure 9

Odeon versus UE4-Wwise comparisons according to T30 values

Fig. 10
figure 10

Odeon versus UE4-Wwise comparisons according to C50

The results show that Wwise Convolution Reverb is successful enough to perform static reverberations, especially for the mid-frequency range. The correlation coefficients were 0.99 for all frequencies except the 63 Hz and 8 kHz, which were resulted in 0.96 and 0.97, respectively. The decay curves in Odeon and Wwise Convolution Reverb are relatively close when using the same impulse responses for the auralization. On the other hand, the results are not exactly same as it might be expected from two static convolution processes. Even if it is a good starting point for this type of VR audio tool, it is still necessary to analyse each IR measurement's echogram and the differences between the Early and Late reflections. It was done by analysing the Clarity (C50) values. The results obtained with Wwise are compared in Fig. 10 with those of the Odeon auralizations.

However the results in Fig. 10 show Wwise Convolution Reverb values are very close to the Odeon, they are not identical. The correlation coefficient results (0.95, 0,94, 0.96, 0.99, 0.96, 0.98, 0.95, 0.86) also attest the same evaluation. Wwise and Odeon values are more close for the mid frequencies where aggregations can be seen while deviations are more salient in the low (63–125 Hz) and high frequencies (8 kHz). Two sample IRs were analysed to go in detail about this difference. IRs of two receiver points inside the conference hall case are plotted in Fig. 11.

Fig. 11
figure 11

Left channel of the BRIR of the conference hall case with both Odeon and Wwise Convolution Reverb plug-in

Fig. 11 shows that despite being so similar, Odeon and Wwise convolutions are not identical as it is expected from a static convolution process. However, it is hard to comment on this mismatch, according to Audiokinetic’s documentation “certain settings are applied offline to the original displayed impulse response file” and every time one of the parameters changes a new IR is created. It can be said that Wwise applies some type of black-box post-processing even with the default settings of Wwise Convolution Reverb plug-in. Even though Wwise Convolution Reverb might be considered successful with these results, it is not enough to merely create immersive dynamic virtual environments due to the convolution's static nature. While it is possible to interpret one IR to different directions and locations by interpolating it through HRTFs or other spatializing effects, to obtain more reliable results in ecologically valid virtual sonic environments, real-time calculations are necessary.

Wwise Reflect plug-in purposes of simulating spatialized early reflections with the real-time calculations. However, as previously mentioned, due to limiting Wwise Reflect with a maximum four reflection order, it can only calculate the very early portion of the IR. It is focused on the early reflections which are more critical with regard to localization info. That is why it is impossible to compare Wwise Reflect with Odeon without changing the Odeon's calculation parameter of maximum reflection order. Even if it is possible to combine both the previous functions, Wwise Reflect for the early reflections and Wwise Convolution Reverb for the late reflection, with an aim to have a dynamic spatialization, it is not an easy task. The use of these plug-ins together causes an overlap issue between the two functions. As Wwise does not provide any automatized way to solve this issue, the only correct way is to set a filter to cut off the early part of the IR in the Wwise Convolution Reverb. This filter method does not rely on any physical phenomena and requires meticulous working on each IR. This work must be carried out for the exact dynamic 3D sound reproduction through Wwise.

The same calculation parameters of Wwise Reflect on Odeon were used to test Wwise Reflect’s performance in a more reliable way by keeping the reflection order at fourth and re-calculating the conference hall in Odeon. The Dirac impulse was used in this experiment. The Left channel of the simulated BRIR of receiver 1 and 4 both in Odeon and Wwise can be seen in Fig. 12. The more detailed graphs for the first 100 ms are also presented next to echogram (energy versus time) of the each receivers.

Fig. 12
figure 12

Left channel of the simulated BRIR of the conference hall case with both Odeon and Wwise (A) and the zoom into first 100 ms (B)

The graphs show how Odeon is denser in terms of reflections. Despite most of the peak points are timely close, general level of the reflections in Wwise Reflect is lower than the Odeon. Besides, so many peaks and dips can be seen in the Wwise Reflect, especially the valleys between the peaks which causes a non-diffuse IR that cannot belong a natural environment. Considering that the reflection orders were at same level, the difference between Wwise and Odeon is probably because of the number of transmitted rays and the performance on simulating scattering. Wwise does not provide any control for the number of rays, it only presents a setting for the maximum source-receiver path length. On the other hand, as the distance increased between the source and receiver (R1 to R4), i.e. as the receiver gets closer to the room's boundaries, the concentrated rays become more visible for Wwise Reflect. This analysis shows that besides several other acoustics phenomena which must be considered for Wwise, one of the primary drawback of Wwise to successfully compile real-time reflections is the number of transmitted rays correspondingly the computational power reserved for real-time auralizations. One of the latest versions of Wwise, 2021.1 is considerably developed in these senses and provides a control over the number of transmitted rays for the early reflections calculation (Audiokinetic 2020b).

6 Discussion

The analysis of audio design capabilities of UE4 and Wwise showed the limits of some of the basic audio functions of one of the most popular game engines, UE4. It is not able to comprehensively describe all the acoustic phenomena that occur during sound propagation. It may be deduced that PBR idea on sound is disregarded by the designers of this platform by prioritizing game design needs. Yet, it is still possible to reproduce some of the physical laws by writing specific functions through C +  + coding and blueprint scripting features. The summary of measurements showed how UE4 game engine or the provided sound assets are insufficient to simulate real-world cases, without using any other plug-in. This makes it impossible to use it as a platform for the physically based acoustic cases, without preparing particular scenarios or working on the C +  + or BP scripting. On the other hand, Audiokinetic's Wwise provides several enhancements over the UE4 audio features. It allows adequate control over the attenuation function and can import impulse responses from simulations and measurements into a virtual environment. The measurements-based comparison between Odeon and Wwise (Convolution and Reflect) showed that Wwise generally achieves to obtain good results. However, there are still a lot to be improved, especially performance at low (< 125 Hz) and high frequencies (> 8000 Hz). Wwise Reflect is a useful tool to fix Wwise Convolution Reverb's deficiency of creating dynamic changes, but it should be reconsidered and possibility to use a more number of rays, a higher number of reflection order and scattering features should be provided. However, its use is still based on some experiential rules like filtering the early part of the IR at Wwise Convolution Reverb.

7 Conclusion

Wwise for acoustic calculations may not be a solution for the near future, but it can be used to present 3D virtual sonic environments to subjects by supporting other methods such as measurements or use of a static acoustic calculation software. It can be done by objectively and automatically filtering the first part of an imported IR. As this study suggests, a control for the number of transmitted rays can increase the success of Wwise Reflect which is already reconsidered by Audiokinetic in the latest versions of Wwise (Audiokinetic 2020b). Increasing the reserved CPU capacities for the acoustic calculations in-game engines would lead to a future of virtual acoustics with similar audio middleware. Despite some disadvantages, the future applications of new audio technology are promising. The improved capabilities of these tools' spatialization features open up a discussion on the procedural audio design methods' future. As the study suggests, this means that new virtual reality design tools should consider including default functions to imitate physical rules based on scientific parameters instead of just providing some randomization parameters to increase the level of interactivity. There is also a need for more data on user experience over spatial audio, as Beig et al. (2019) said: "we may be processing sound in a physically realistic but perceptually vacuous way". Subjective experiments should be carried out to compare the perceived differences among the different setups of UE4, UE4-Wwise and Odeon and to identify the values of the Just Noticeable Differences for the virtual acoustics parameters through the listening and audio-visual modalities. The study here was limited to attenuation and reverberation functions of UE4 and Wwise. However, other features like diffraction, transmission, scattering, localization or directivity are the other fundamental acoustic phenomenon which must be investigated too. As mentioned, not just the objective acoustic parameters, subjective tests must be considered as well.