1 Introduction

Artificial intelligence (AI) has gained widespread attention in different areas of research including computer vision, natural language processing (NLP), robotics, healthcare, and especially in audio signal processing. Audio processing covers many diverse fields including speech, music and environmental sound processing. In all these areas, AI techniques are playing crucial roles in designing audio-based intelligent systems (Purwins et al. 2019). One of the prime goals of AI is to create fully autonomous audio-based intelligent agents that can listen or interact with their environments to improve their behaviour over time through trial and error. Designing such autonomous systems has been a long-standing problem, ranging from robots that can react to the changes in their environment, to purely software-based agents that can interact with humans using natural language and multimedia. Reinforcement learning (RL) (Sutton et al. 1998) represents a principled mathematical framework of such experience-driven learning. Although RL had some successes in the past (Kohl and Stone 2004; Ng et al. 2006; Singh et al. 2002), previous methods were inherently limited to low-dimensional problems due to lack of scalability. Moreover, RL also has issues of memory, computational and sample complexity—in the case of learning algorithms (Strehl et al. 2006). Recently, deep learning (DL) models have risen as new tools with powerful function approximation and representation learning properties to solve these issues.

The advent of DL has dramatically improved the state-of-the-art performance and significantly impacted many areas from transportation to health and from social science to biology. Deep models such as deep neural networks (DNNs) (Hinton et al. 2012; Mohamed et al. 2009), convolutional neural networks (CNNs) (LeCun et al. 1989), and long short-term memory (LSTM) networks (Hochreiter and Schmidhuber 1997) have also enabled many practical applications by outperforming traditional methods in audio signal processing. The use of DL algorithms within RL has accelerated the progress of RL. This has given rise to the field of deep reinforcement learning (DRL). DRL embraces the advancements in DL to establish the learning processes, performance and speed of RL algorithms. This enables RL to operate in high-dimensional state and action spaces to solve previously unsolvable complex problems. Inspired by previous works such as (Lange et al. 2012), two outstanding works kick-started the revolution in DRL. The first was the development of an algorithm that could learn to play Atari 2600 video games directly from image pixels at a superhuman level (Mnih et al. 2015). The second success was the design of the hybrid DRL system, AlphaGo, which defeated a human world champion in the game of Go (Silver et al. 2016). In addition to playing games, DRL has also been explored in a wide range of applications such as computer vision (Le et al. 2021), natural language processing (NLP) (Naeem et al. 2020), robotics to control policies (Levine et al. 2016); generalisable agents in complex environments with meta-learning (Duan et al. 2016; Wang et al. 2016); indoor navigation (Zhu et al. 2017), and many more (Arulkumaran et al. 2017). In particular, DRL is also gaining increased interest in audio signal processing.

Table 1 Comparison of our paper with that of the existing DRL-based surveys

In audio processing, DRL has been recently used as an emerging tool to address various problems and challenges in automatic speech recognition (ASR), spoken dialogue systems (SDSs), speech emotion recognition (SER), audio enhancement, music generation, and audio-driven controlled robotics. In this work, we, therefore, focus on covering the advancements in audio processing by DRL. In Fig. 1, we present the cumulative distribution of publications in core DRL and applied to the audio domain. We note an emerging increased interest in the communities of both core and applied DRL. While core DRL grew from 3 to 4 orders of magnitude from 2015 to 2021, applied DRL grew from 2 to 3 orders of magnitude in the same period.

Fig. 1
figure 1

Cumulative distribution of publications per year (data gathered from 2015 to 2021)—from https://www.scopus.com

There are multiple survey articles on DRL. For instance, Arulkumaran et al. (2017) presented a brief survey on DRL by covering seminal and recent developments in DRL—including innovative ways in which DNNs can be used to develop autonomous agents. Similarly, in Li (2017), authors attempted to provide comprehensive details on DRL and cover its applications in various areas to highlight advances and challenges. Other relevant works include applications of DRL in communications and networking (Luong et al. 2019), human-level agents (Nguyen et al. 2017), and autonomous driving (Sallab et al. 2017). None of these articles has focused on DRL applications in audio processing as highlighted in Table 1. This paper aims to fill this gap by presenting an up-to-date literature review on DRL studies in the audio domain, discussing challenges that hinder the progress of DRL in audio, and pointing out future research areas. We hope this paper will help researchers and scientists interested in DRL for audio-driven applications.

This paper is organised as follows. A concise background of DL and RL is provided in Sect. 2, followed by an overview of recent DRL algorithms in Sect. 3. With those foundations, Sect. 4 covers recent DRL works in the domains of speech, music, and environmental sound processing; and their challenges are discussed in Sect. 5. Section 6 summaries this review and highlights future pointers for audio-based DRL research and Sect. 7 concludes the paper.

2 Background

2.1 Deep learning (DL)

Deep neural networks (DNNs) have been shown to produce state-of-the-art results in audio and speech processing due to their ability to distil compact and robust representations from large amounts of data. The first major milestone was significantly increasing the accuracy of large-scale automatic speech recognition (ASR) using fully connected DNNs and deep autoencoders around 2010 (Hinton et al. 2012). It focuses to use DNNs with multiple nonlinear modules arranged hierarchically in layers to automatically discover suitable representations or features from raw data. These non-linearities allow DNNs to learn complicated manifolds in speech and audio datasets. Various other deep architectures have shown the potentials to learn from audio to perform different tasks. Below we discuss these DL architectures, which are illustrated in Fig. 2.

Convolutional neural networks (CNNs) are a kind of feedforward neural networks that was specifically designed for processing images (Krizhevsky et al. 2012). However, CNNs have also been applied to various other fields including NLP (Arora et al. 2019), audio processing (Latif et al. 2019), and text analysis (Wang et al. 2019), and they have shown state-of-the-art performance. CNNs consist of a series of convolutional layers interleaved with pooling layers, followed by one or more dense layers. Whilst pooling layers reduce the spatial size of feature maps to decrease the amount of parameters, the neurons in dense layers are connected to every neuron in the preceding layer. In contrast to DNNs, CNNs limit the number of parameters and memory requirements dramatically by leveraging on two key concepts: local receptive fields and shared weights. A local receptive field refers to the region of the layer that is connected to any particular neuron in the next layer—note that receptive field, kernel and filter are used interchangeably. Shared weights, on the other hand, refers to the same weights used across all receptive fields in same layer of CNN, as opposed to each receptive field in the layer having its own set of weights. Recently, CNN-based models have been extensively studied for a variety of audio processing tasks including music onset detection (Schlüter and Böck 2014), speech enhancement (Mamun et al. 2019), ASR (Abdel-Hamid et al. 2014), speech emotion recognition (Latif et al. 2020), etc. However, a raw audio waveform with high sample rates might have problems with limited receptive fields of CNNs, which can result in deteriorated performance. To handle this performance issue, dilated convolution layers can be used in order to extend the receptive field by inserting zeros between their filter coefficients (Chang et al. 2018; Chen et al. 2019).

Fig. 2
figure 2

Graphical illustration of different DL architectures

Recurrent neural networks (RNNs) follow a different approach for modelling sequential data (Lipton 2015). They introduce recurrent connections to enable parameters to be shared across time, which makes them very powerful in learning temporal structures from the input sequences (e.g., audio, video). They have demonstrated their superiority over traditional HMM-based systems in a variety of speech and audio processing tasks (Latif et al. 2020). Due to these abilities RNNs architectures including long-short term memory (LSTM) (Hochreiter and Schmidhuber 1997) and gated recurrent unit (GRU) (Cho et al. 2014) networks have an enormous impact in the speech community and have been incorporated in state-of-the-art audio-based systems. Recently, RNNs have also extended to include information in the frequency domain besides temporal information in the form of frequency-LSTMs (Li et al. 2015) and time-frequency LSTMs (Sainath and Li 2016). In order to benefit from both neural architectures, CNNs and RNNs can be combined into a single network with convolutional layers followed by recurrent layers, often referred to as convolutional recurrent neural networks (CRNN). Related works have shown CRNNs abilities in ASR (Qian et al. 2016), SER (Latif 2020), music classification (Ghosal and Kolekar 2018), and other audio related applications (Latif et al. 2020).

Sequence-to-sequence (Seq2Seq) models were motivated due to problems requiring sequences of unknown lengths (Sutskever et al. 2014). Although they were initially applied to machine translation, they can be applied to many different applications involving sequence modelling. In a Seq2Seq model, while one RNN reads the inputs in order to generate a vector representation (the encoder), another RNN inherits those learnt features to generate the outputs (the decoder). Seq2Seq models have been gaining much popularity in the speech community due to their capability of transducing input to output sequences. DL frameworks are particularly suitable for this direct translation task due to their large model capacity and their capability to train in an end-to-end manner—to directly map the input signals to the target sequences (Zhang et al. 2017; Lu et al. 2016; Liu et al. 2019). Various Seq2Seq models have been explored in the speech, audio and language processing literature including recurrent neural network transducer (RNNT) (Graves 2012), monotonic alignments (Raffel et al. 2017), listen, attend and spell (LAS) (Chan et al. 2016), neural transducer (Jaitly et al. 2016), recurrent neural aligner (RNA) (Raffel et al. 2017), and transformer networks (Pham et al. 2019), among others. In particular, transformer-based models have achieved unprecedented success in numerous speech and audio processing tasks including audio classification (Chi et al. 2021), speaker recognition (Wang et al. 2021), and speech-to-text (Bae et al. 2021), to name a few. These transformer models consist of an encoder-decoder architecture and work by leveraging the multi-head self-attention mechanism to consider the longer-distanced context around a word in a computationally efficient way (Vaswani et al. 2017). This makes them not only pay equal attention to all the elements in the sequence to boost accuracy but also results in harnessing the power of modern GPUs parallel environment for faster sequence processing compared to RNNs (Karita et al. 2019). For a more in-depth discussion about applications of transformers in audio processing, we refer interested readers to recent relevant survey papers (Lin et al. 2021; Tay et al. 2020).

Generative models have been attaining much interest in the audio community due to their abilities to learn the underlying audio distribution. Generative adversarial networks (GANs) (Goodfellow et al. 2014), variational autoencoders (VAEs) (Kingma and Welling 2013), and autoregressive models (Shannon et al. 2012) are extensively investigated in the speech and audio processing scientific community. Specifically, they are used to synthesised audio signal from a low-dimensional representation to a high-resolution signal (Hsu et al. 2017; Ma et al. 2019; Latif et al. 2018). The synthesised samples are often used to augment the training material to improve the performance (Latif et al. 2020). In the autoregressive approach, the new samples are synthesised iteratively—based on an infinitely long context of previous samples via RNNs (for example, using LSTM or GRU networks)—but at the cost of expensive computation during training (Wang et al. 2018).

2.2 Reinforcement learning

Reinforcement learning (RL) is a popular paradigm of ML, which involves agents to learn their behaviour by trial and error (Sutton et al. 1998). RL agents aim to learn sequential decision-making by successfully interacting with the environment where they operate. At time t (0 at the beginning of the interaction, T at the end of an episodic interaction, or \(\infty \) in the case of non-episodic tasks), an RL agent in state \(s_{t}\) takes an action \(a\in A\), transits to a new state \(s_{t+1}\), and receives reward \(r_{t+1}\) for having chosen action a. This process—repeated iteratively—is illustrated in Fig. 3.

Fig. 3
figure 3

Basic RL setting

An RL agent aims to learn the best sequence of actions, known as policy, to obtain the highest overall cumulative reward in the task (or set of tasks) that is being trained on. While it can choose any action from a set of available actions, the set of actions that an agent takes from start to finish is called an episode. A Markov decision process (MDP) (Bellman 1966) can be used to capture the episodic dynamics of an RL problem. An MDP can be represented using the tuple (S, A, \(\gamma \), P, R). The decision-maker or agent chooses an action \(a \in A\) in state \(s \in S\) at time t according to its policy \(\pi (a_{t}|s_{t})\)—which determines the agent’s way of behaving. The probability of moving to the next state \(s_{t+1} \in S\) is given by the state transition function \(P(s_{t+1}|s_{t}, a_{t})\). The environment produces a reward \(R(s_t, a_t, s_{t+1})\) based on the action taken by the agent at time t. This process continues until the maximum time step or the agent reaches a terminal state. The objective is to maximise the expected discounted cumulative reward, which is given by:

$$\begin{aligned} E_{\pi }[R_{t}]=E_{\pi }\left[ \sum _{i=0}^{\infty }\gamma ^{i}r_{t+i}\right] \end{aligned}$$

where \(\gamma \) \(\in \) [0,1] is a discount factor used to specify that rewards in the distant future are less valuable than in the nearer future. While an RL agent may only learn its policy, it may also learn (online or offline) the transition and reward functions.

3 Deep reinforcement learning

Deep reinforcement learning (DRL) combines conventional RL with DL to overcome the limitations of RL in complex environments with large state spaces or high computation requirements. DRL employs DNNs to estimate value, policy or model that are learnt through the storage of state-action pairs in conventional RL (Li 2017). Deep RL algorithms can be classified along several dimensions. For instance, on-policy vs off-policy, model-free vs model-based, value-based vs policy-based DRL algorithms, among others. The salient features of various key characteristics of DRL algorithms are presented and depicted in Fig. 4. Interested readers are referred to Li (2017) for more details on these algorithms. This section focuses on popular DRL algorithms employed in audio-based applications in three main categories: (i) value-based DRL, (ii) policy gradient-based DRL and (iii) model-based DRL.

Fig. 4
figure 4

Characteristics of different DRL algorithms

3.1 Value-based DRL

One of the most famous value-based DRL algorithms is deep Q-network (DQN), introduced by Mnih et al. (2015), that learns directly from high-dimensional inputs. It employs CNNs referred to as Q-network to estimate a value function Q(sa) by minimizing the loss function at \(i^{th}\) iteration given by

$$\begin{aligned} \begin{aligned} L_{i}(\theta _{i})={\mathbb {E}}_{s,a\sim p(.)}[(y_{i}-Q(s,a;\theta _{i}))^2],\\ \end{aligned} \end{aligned}$$

where \(y_{i}={\mathbb {E}}_{s\prime \sim s}[r+\gamma \underset{a\prime }{\text {max}} Q(s\prime , a\prime ;\theta _{i-1}|{s,a}]\) defines the \(i^{th}\) iteration target and \(\theta \) represents the weights of the Q-network. DQN enhances data efficiency and stability of the learning process using a technique known as experience replay, where the agent’s experience at each time step t, \(e_t = \{s_t, a_t, r_t, s_{t+1}\}\) is stored in a replay memory. Subsequently, mini-batches of experience \(e \sim D\), where \(D = \{e_1, e_2,e_3,\ldots ,e_N\)} are randomly selected and updated using Q-learning. Post-experience replay, the agent applies \(\epsilon \)-greedy policy to select and execute an action. Although DQN, since inception, has rendered super-human performance in Atari games, it is based on a single max operator, given in (2), for selection as well as evaluation of an action. Thus, the selection of an overestimated action may lead to over-optimistic action value estimates that induces an upward bias. Double DQN (DDQN) (Van Hasseltet et al. 2016) eliminates this positive bias by introducing two decoupled estimators: one for the selection of an action, and one for the evaluation of an action. Schaul et al. in Schaul et al. (2016) show that the performance of DQN and DDQN is enhanced considerably if significant experience transitions are prioritised and replayed more frequently. Wang et al. (2016) present a duelling network architecture (DNA) to estimate a value function V(s) and associated advantage function A(sa) separately, and then combine them to get action-value function Q(sa). Contrary to DQN that follows convolutional layers with a single fully connected layer, DNA employs two fully connected layers for the estimation of scalar \(V(s;\theta ,\beta )\) and |A|-dimensional vector \(A(s,a;\theta ,\alpha )\). Here, \(\theta \) represents convolutional layers parameters, whereas \(\alpha \) and \(\beta \) are parameters of two fully connected streams (layers). The two fully connected layers are combined to output a single Q value as per the equation articulated below. Results show that DQN and DDQN having DNA and prioritised experience replay can lead to improved performance.

$$\begin{aligned} \begin{aligned} Q(s,a;\theta ,\alpha ,\beta )= V(s;\theta ,\beta ) + \Big (A(s,a;\theta ,\alpha ) - \frac{1}{|A|}\sum A(s,a';\theta ,\alpha )\Big ), \end{aligned} \end{aligned}$$

Unlike the aforementioned DQN algorithms that focus on the expected return, distributional DQN (Bellemare et al. 2017) aims to learn the full distribution of the value in order to have additional information about rewards. Despite both DQN and distributional DQN focusing on maximising the expected return, the latter comparatively results in performant learning. Dabney et al. (2018) propose distributional DQN with quantile regression (QR-DQN) to explicitly model the distribution of the value function. Results demonstrate that QR-DQN successfully bridges the gap between theoretic and algorithmic results. Implicit quantile networks (IQN) (Dabney et al. 2018), an extension to QR-DQN, estimate quantile regression by learning the full quantile function instead of focusing on a discrete number of quantiles. IQN also provides flexibility regarding its training with the required number of samples per update, ranging from one sample to a maximum computationally allowed. IQN has shown to outperform QR-DQN comprehensively in the Atari domain.

The astounding success of DQN to learn rich representations is highly attributed to DNNs, while batch algorithms prove to have better stability and data efficiency (requiring less tuning of hyperparameters). The authors in Levine et al. (2017) propose a hybrid approach named as least-squares DQN (LS-DQN) that exploits the advantages of both DQN and batch algorithms. Deep Q-learning from demonstrations (DQfD) (Hester et al. 2018) leverages human demonstrations to learn at an accelerated rate from the start. Deep quality-value (DQV) (Sabatelli et al. 2018) is a novel temporal-difference-based algorithm that trains the Value network initially, and subsequently uses it to train a Quality-value neural network for estimating a value function. Results in the Atari domain indicate that DQV outperforms DQN as well as DDQN. The authors in Arjona-Medina et al. (2019) propose RUDDER (return decomposition for delayed rewards), which encompasses reward redistribution and return decomposition for Markov decision processes (MDPs) with delayed rewards. Pohlen et al. (2018) employ a transformed Bellman operator along with human demonstrations in the proposed algorithm Ape-X DQfD to attain human-level performance over a wide range of games. Results show that the proposed algorithm achieves average-human performance in 40 out of 42 Atari games with the same set of hyperparameters. Schulman et al. in Schulman et al. (2017) study the connection between Q-learning and policy gradient methods. They show that soft Q-learning (an entropy-regularised version of Q-learning) is equivalent to policy gradient methods and that they perform as well (if not better) than standard variants.

Previous studies have also attempted to incorporate a memory element into DRL algorithms. For instance, the deep recurrent Q-network (DRQN) approach introduced by Hausknecht and Stone (2015) was able to successfully integrate information through time, which performed well on standard Atari games. A further improvement was made by introducing an attention mechanism to DQN, resulting in a deep recurrent Q-network (DARQN) (Sorokin et al. 2015). This allows DARQN to focus on a specific part of the input and achieve better performance compared to DQN and DRQN on games. Some other studies (Oh et al. 2016; Parisotto and Salakhutdinov 2018) have also proposed methods to incorporate memory into DRL, but this area remains to be investigated further.

3.2 Policy gradient-based DRL

Policy gradient-based DRL algorithms aim to learn an optimal policy that maximises performance objectives, such as expected cumulative reward. This class of algorithms make use of gradient theorems to reach optimal policy parameters. Policy gradient typically requires the estimation of a value function based on the current policy. This may be accomplished using the actor-critic architecture, where the actor represents the policy and the critic refers to value function estimate (Konda and Tsitsiklis 1999). Mnih et al. (Mnih et al. 2016) proposed the advantage actor-critic (A2C) algorithm, which employs an advantage function instead of a value function for updating network weights. The advantage function \(A (s_t, a_t) = Q (s_t, a_t) - V (s_t)\) estimates the benefit of a chosen action over an average action for a given state. The authors of Mnih et al. (2016) demonstrate that actor-critic methods yield superior results over value-based methods in terms of training speed. They further show that asynchronous execution of multiple parallel agents on standard CPU-based hardware leads to time-efficient and resource-efficient learning. The proposed asynchronous version of actor-critic, asynchronous advantage actor-critic (A3C) updates policy and value functions after every \(t_{max}\) actions or in case a terminal state is reached. For a single update, the agent first receives n-step returns by selecting actions based on its exploration policy till \(t_{max}\) steps or terminal state. Afterwards, n-step Q-learning updates are computed for every state-action pair that are further used in the calculation of a single gradient step. A3C exhibits remarkable learning in both 2D and 3D games with action spaces in discrete as well as continuous domains. The authors in Babaeizadeh et al. (2017) propose a hybrid CPU/GPU-based A3C —named as GA3C — showing significantly higher speeds as compared to its CPU-based counterpart.

Asynchronous actor-critic algorithms, including A3C and GA3C, may suffer from inconsistent and asynchronous parameter updates. A novel framework for asynchronous algorithms is proposed in Alfredo et al. (2017) to leverage parallelisation while providing synchronous parameters updates. The authors show that the proposed parallel advantage actor-critic (PAAC) algorithm enables true on-policy learning in addition to faster convergence. The authors in O’Donoghue et al. (2016) propose a hybrid policy-gradient-and-Q-learning (PGQL) algorithm that combines on-policy policy gradient with off-policy Q-learning. Results demonstrate PGQL’s superior performance on Atari games as compared to both A3C and Q-learning. Munos et al. (2016) propose a novel algorithm by bringing together three off-policy algorithms: Instance Sampling (IS), Q(\(\lambda \)), and Tree-Backup TB(\(\lambda \)). This algorithm—called Retrace(\(\lambda \))—alleviates the weaknesses of all three algorithms (IS has low variance, Q(\(\lambda \)) is not safe, and TB(\(\lambda \)) is inefficient) and promises safety, efficiency and guaranteed convergence. Reactor (Retrace-Actor) (Gruslys et al. 2017) is a Retrace-based actor-critic agent architecture that combines time efficiency of asynchronous algorithms with sample efficiency of off-policy experience replay-based algorithms. Results in the Atari domain indicate that the proposed algorithm performs comparably with state-of-the-art algorithms while yielding substantial gains in terms of training time. The importance of weighted actor-learner architecture (IMPALA) (Espeholt et al. 2018) is a scalable distributed agent that is capable of handling multiple tasks with a single set of parameters. Results show that IMPALA outperforms A3C baselines in a diverse multi-task environment.

Schulman et al. (2015) propose a robust and scalable trust region policy optimisation (TRPO) algorithm for optimising stochastic control policies. TRPO promises guaranteed monotonic improvement regarding the optimisation of nonlinear and complex policies having an inundated number of parameters. This learning algorithm makes use of a fixed KL divergence constraint rather than a fixed penalty coefficient and outperforms a number of gradient-free and policy-gradient methods over a wide variety of tasks. Schulman et al. (2017) introduce proximal policy optimisation (PPO), which aims to be as reliable and stable as TRPO but relatively better in terms of implementation and sample complexity.

Table 2 Summary of DRL algorithms

3.3 Model-based DRL

Model-based DRL algorithms rely on models of the environment (i.e. underlying dynamics and reward functions) in conjunction with a planning algorithm. Unlike model-free DRL methods that typically entail a large number of samples to render adequate performance, model-based algorithms generally lead to improved sample and time efficiency (Ravindran 2019).

Kaiser et al. (2019) propose simulated policy learning (SimPLe), a video prediction-based model-based DRL algorithm that requires much fewer agent-environment interactions than model-free algorithms. Experimental results indicate that SimPLe outperforms state-of-the-art model-free algorithms in Atari games. Whiteson (2018) propose TreeQN for complex environments, where the transition model is not explicitly given. The proposed algorithm combines model-free and model-based approaches in order to estimate Q-values based on a dynamic tree constructed recursively through an implicit transition model. Authors of Whiteson (2018) also propose an actor-critic variant named ATreeC that augments TreeQN with a softmax layer to form a stochastic policy network. They show that both algorithms yield superior performance than n-step DQN and value prediction networks (Oh et al. 2017) on multiple Atari games. Vezhnevets et al. (2016) introduce a Strategic Attentive Writer (STRAW), which is capable of making natural decisions by learning macro-actions. Unlike state-of-the-art DRL algorithms that yield only one action after every observation, STRAW generates a sequence of actions, thus leading to structured exploration. Experimental results indicate a significant improvement in Atari games with STRAW. Value Propagation (VProp) (Nardelli et al. 2018) is a set of Value Iteration-based planning modules trained using RL and capable of solving unseen tasks and navigating in complex environments. It is also demonstrated that VProp is able to generalise in a dynamic and noisy environment. Schrittwieser et al. (2019) present a model-based algorithm named MuZero that combines tree-based search with a learned model to render superhuman performance in challenging environments. Experimental results demonstrate that MuZero delivers state-of-the-art performance on 57 diverse Atari games. Table  2 presents an overview of DRL algorithms at a glance.

3.4 Audio processing using DRL

Audio processing using DRL include different components including environment, agent, action, and reward. Audio is a 1-dimensional (1D) time-series signal that goes through different pre-processing and feature extraction procedures. Pre-processing steps involve noise suppression, silence removal, and channel equalisation, which enhances audio signal quality to build robust and efficient audio-based systems. It has been found that pre-processing helps to improve DL-based audio systems (Latif et al. 2020). Feature extraction usually comes after pre-processing, which aims to convert an audio signal into meaningful, informative, and a reasonably limited number of features. Mel-frequency cepstral coefficients (MFCCs) and spectrograms are considered a popular choice of input features in audio-based systems (Latif et al. 2020). These features are given to the DRL agent to perform different tasks based on the application. An example scenario is a human speaking to a machine trained via DRL as in Fig. 5, where the machine has to act based on features derived from audio (among other) signals. We discuss in detail different audio-based systems next.

4 Audio-based DRL

This section surveys related works where audio is a key element in the learning environments of DRL agents. Table 3 summarises the characterisation of DRL agents for six audio-related areas: (I) automatic speech recognition; (II) spoken dialogue systems; (III) emotions modelling; (IV) audio enhancement; (V) music listening and generation; and (VI) human–robot interaction (HRI). There is a large literature on audio-based DRL and it is used in a wide variety of applications. Therefore and in order to keep this review to a manageable length, we limit ourselves to these six main areas here. In Sect. 4.7 we briefly mention some remaining audio-related areas and other applications.

Fig. 5
figure 5

Schematic diagram of DRL agents for audio-based applications, where the DL model (via DNNs, CNN, RNNs, etc.) generates audio features from raw waveforms or other audio representations for taking actions that change the environment from state \(s_t\) to a next state \(s_{t+1}\)

Table 3 Summary of audio related fields, characterisation of DRL agents, and related datasets

4.1 Automatic speech recognition (ASR)

Automatic speech recognition (ASR) is the process of converting a speech signal into its corresponding text by using algorithms. Contemporary ASR technology has reached great levels of performance due to advancements in DL techniques. The performance of ASR systems, however, relies heavily on supervised training of deep models with large amounts of transcribed data. Even for resource-rich languages, additional transcription costs required for new tasks hinders the applications of ASR. To broaden the scope of ASR, different studies have attempted DRL based models to learn from feedback or environment. This form of learning aims to reduce transcription costs and time by providing positive or negative rewards instead of detailed transcriptions. For instance, Kala and Shinozaki (2018) proposed an RL framework for ASR based on the policy gradient method that provides a new view of existing training and adaptation methods. This makes the ASR system self-sufficient to learn from feedback of users and help achieve improved speech recognition performance and reduced Word Error Rate (WER) compared to unsupervised adaptation. In ASR, sequence-to-sequence models have shown great success; however, these models fail to approximate real-world speech during inference. Tjandra et al. (2018) solved this issue by training a sequence-to-sequence model with a policy gradient algorithm. In contrast to standard training on maximum likelihood estimation (MLE), they used policy gradient to sample the whole transcription by directly optimising the negative Levenshtein distance as the reward. Their results showed a significant improvement using an RL-based objective and an MLE objective compared to the model trained with only the MLE objective. In another study, (Tjandra et al. 2019) the authors found that using token-level rewards (intermediate rewards are given after each time step) provides improved performance compared to sentence-level rewards and baseline systems. In order to solve the issues of semi-supervised training of sequence-to-sequence ASR models, Chung et al. (2020) investigated the REINFORCE algorithm by rewarding the ASR to output more correct sentences for both unpaired and paired speech input data. Experimental evaluations showed that the DRL-based method was able to effectively reduce character error rates from 10.4 to 8.7%.

Karita et al. (2018) propose to train an encoder-decoder ASR system using a sequence-level evaluation metric based on the policy gradient objective function. This enables the minimisation of the expected WER of the model predictions. In this way, the authors found that the proposed method improves speech recognition performance. The ASR system of Zhou et al. (2018) was jointly trained with maximum likelihood and policy gradient to improve via end-to-end learning. They were able to optimise the performance metric directly with policy learning and achieve 4% to 13% relative improvement for end-to-end ASR. In Luo et al. (2017), the authors attempted to solve sequence-to-sequence problems by proposing a model based on supervised backpropagation and a policy gradient method, which can directly maximise the log probability of the correct answer. They achieved very encouraging results on a small scale and a medium scale ASR. Radzikowski et al. (2019) proposed a dual supervised model based on a policy gradient methodology for non-native speech recognition. They evaluated tested warm-start and semi warm-start approaches, and were able to achieve promising results for the English language pronounced by Japanese and Polish speakers.

To achieve the best possible accuracy, end-to-end ASR systems are becoming increasingly large and complex. DRL methods can also be leveraged to provide model compression (He et al. 2018). In Dudziak et al. (2019), RL-based ShrinkML is proposed to optimise the per-layer compression ratios in a state-of-the-art LSTM-based ASR model with attention. They exploited RL to push the boundaries of singular value decomposition (SVD) based ASR mode compression. Evaluations were preformed on LibriSpeech data. Based on the results, the authors found that the RL-based model was able to effectively compress a ASR system compared to the manually-compressed models. For time-efficient ASR, Rajapakshe et al. (2020) evaluated the pre-training of an RL-based policy gradient network. They found that pre-training in DRL offers faster convergence compared to non-pre-trained networks, and also achieve improved recognition in lesser time. To tackle the slow convergence time of the REINFORCE algorithm (Williams 1992; Lawson et al. 2018), evaluated Variational Inference for Monte Carlo Objectives (VIMCO) and Neural Variational Inference (NVIL) for phoneme recognition tasks in clean and noisy environments. The authors found that the proposed method (using VIMCO and NVIL) outperforms REINFORCE and other methods at training online sequence-to-sequence models.

The studies above highlight ways to improve the performance of ASRs by involving interaction with the environment using DRL. Despite these promising results, further research is required on DRL algorithms towards building autonomous ASR systems that can work in complex real-life settings. The REINFORCE algorithm is very popular in ASR, therefore, research is also required to explore other DRL algorithms to highlight its suitability for ASR.

4.2 Spoken dialogue systems (SDSs)

Spoken dialogue systems are gaining interest due to many applications in customer services and goal-oriented human-computer interaction. Typical SDSs integrate several key components including speech recogniser, intent recogniser, knowledge base and/or database backend, dialogue manager, language generator, and speech synthesis, among others (Zue and Glass 2000). The task of a dialogue manager in SDSs is to select actions based on observed events (Levin et al. 2000; Singh et al. 2000). Researchers have shown that the action selection process can be effectively optimised using RL to model the dynamics of spoken dialogue as a fully or partially observable Markov Decision Process (Paek 2006). Numerous studies have utilised RL-based algorithms in spoken dialogue systems. In contrast to text-based dialogue systems that can be trained directly using large amounts of text data (Gao et al. 2019), most SDSs have been trained using user simulations (Schatzmann et al. 2006). The justification for that is mainly due to insufficient amounts of training dialogues to train or test from real data (Serban et al. 2018).

SDSs involve policy optimisation to respond to humans by taking the current state of the dialogue, selecting an action, and returning the verbal response of the system. For instance, Chen et al. (Chen et al. 2020) presented an online DRL-based dialogue state tracking framework in order to improve the performance of a dialogue manager. They achieved promising results for online dialogue state tracking in the second and third dialogue state tracking challenges (Henderson et al. 2014, 2014). Weisz et al. (2018) utilised DRL approaches, including actor-critic methods and off-policy RL. They also evaluated actor-critic with experience replay (ACER) (Wang et al. 2016; Munos et al. 2016), which has shown promising results on simple gaming tasks. They showed that the proposed method is sample efficient and that performed better than some state-of-the-art DL approaches for spoken dialogue. A task-oriented end-to-end DRL-based dialogue system is proposed in Cuayáhuitl (2017). They showed that DRL-based optimisation produced significant improvement in task success rate and also caused a reduction in dialogue length compared to supervised training. Zhao and Eskenazi (2016) utilised deep recurrent Q-networks (DRQN) for dialogue state tracking and management. Experimental results showed that the proposed model can exploit the strengths of DRL and supervised learning to achieve faster learning speed and better results than the modular-based baseline system. To present baseline results, a benchmark study (Casanueva et al. 2017) is performed using DRL algorithms including DQN, A2C and natural actor-critic (Su et al. 2017) and their performance is compared against GP-SARSA (Gašić et al. 2013). Based on experimental results on the PyDial toolkit (Ultes et al. 2017), the authors conclude that substantial improvements are still needed for DRL methods to match the performance of carefully designed handcrafted policies.

In addition to SDSs optimised via flat DRL, hierarchical RL/DRL methods have been proposed for policy learning using dialogue states with different levels of abstraction and dialogue actions at different levels of granularity (via primitive and composite actions) (Cuayáhuitl 2009; Cuayáhuitl et al. 2010; Dethlefs and Cuayáhuitl 2015; Budzianowski et al. 2017; Peng et al. 2017; Zhang et al. 2018). The benefits of this form of learning include faster training and policy reuse. A deep Q-network based multi-domain dialogue system is proposed in Cuayáhuitl et al. (2016). They train the proposed SDS using a network of DQN agents, which is similar to hierarchical DRL but with more flexibility for transitioning across dialogues domains. Another work-related to faster training is proposed by Gordon-Hall et al. (2020), where the behaviour of RL agents is guided by expert demonstrations.

The optimisation of dialogue policies requires a reward function that unfortunately is not easy to specify. Unless a clear and concrete performance function is available (rather unlikely), this stage may require annotated data for training a reward predictor instead of a hand-crafted one. In real-world applications, such annotations are either scarce or not available. Therefore, some researchers have turned their attention to methods for online active reward learning. In Su et al. (2016), the authors presented an online learning framework for a spoken dialogue system. They jointly trained the dialogue policy alongside the reward model via active learning. Based on the results, the authors showed that the proposed framework can significantly reduce data annotation costs and can also mitigate noisy user feedback in dialogue policy learning. Su et al. (2017) introduced two approaches: trust region actor-critic with experience replay (TRACER) and episodic natural actor-critic with experience replay (eNACER) for dialogue policy optimisation. From these two algorithms, they achieved the best performance using TRACER.

In Ultes et al. (2017), the authors propose to learn a domain-independent reward function based on user satisfaction for dialogue policy learning. The authors showed that the proposed framework yields good performance for both task success rate and user satisfaction. Researchers have also used DRL to learn dialogue policies in noisy environments, and some have shown that their proposed models can generate dialogues indistinguishable from human ones (Fazel-Zarandi et al. 2017). Carrara et al. (2017) propose a clustering approach for online user adaptation in RL-based dialogue systems. They propose a distance metric and build on previous works in an attempt to reduce the number of possible transfers from similar users. Experiments were carried out on a negotiation dialogue task, which showed significant improvements over baselines. In another study (Carrara et al. 2018), authors proposed \(\epsilon \)-safe, a Q-learning algorithm, for safe transfer learning for dialogue applications. A DRL-based chatbot called MILABOT was designed in Serban et al. (2017), which can converse with humans on popular topics through both speech and text—performing significantly better than many competing systems. The text-based chatbot in Cuayáhuitl et al. (2019) used an ensemble of DRL agents and showed that training multiple dialogue agents performs better than a single agent.

Table 4 shows a summary of DRL-based (spoken) dialogue systems. While not all involve spoken interactions, they can be applied to speech-based systems by for example using the outputs from speech recognisers instead of typed interactions, i.e. ASR systems can be seen as feature extractors from audio data for dialogic interaction. While at first instance it may look like text-based DRL agents would not be able to cope with noisy inputs, using ASR-based inputs can be useful because they can be enriched with word, sentence, and/or knowledge embeddings for dealing with unobserved utterances during training. This form of generalisation would be hard to achieve (if not impossible) using only audio-based features. In addition, modelling dialogue history taking features from multiple utterances in a conversation can help to deal with noisy inputs. But what features to include in the dialogue history over multiple turns has been and it is still task-specific—task-agnostic features is something that needs to be investigated further to benefit the creation of applications bootstrapped by previous ones. Including low and high-level features ranging from speech features, to multimodal and knowledge features is something that requires further understanding in order to draw recommendations for different applications. As a matter of fact and to our knowledge, the inclusion of audio features in the dialogue state has been overlooked in SDSs (with the exception of Zorrilla et al. (2021)) and it could prove useful, but this remains to be investigated further.

In terms of application, we can observe in Table 4 that most systems focus on one or a few domains (or tasks)—systems trained with a large number of domains is usually not attempted, presumably due to the high requirements of data and compute involved. Regarding algorithms, the most popular are DQN-based or REINFORCE among other more recent algorithms—when to use one over another algorithm still needs to be understood better. We can also observe that user simulations are mostly used for training task-oriented dialogue systems, while real data is the preferred choice for open-ended dialogue systems. We can note that while transfer learning is an important component in a trained SDS, it is not commonplace yet. Given that learning from scratch every time a system is trained is neither scalable nor practical, it looks like transfer learning will naturally be adopted more and more in the future as more domains are taken into account. In terms of datasets, most of them are still small size. It is rare to see SDSs trained with millions of dialogues or sentences. As datasets grow, the need for more efficient training methods will take more relevance in future systems. Regarding human evaluations, we can observe that about half of research works involve human evaluations. While human evaluations may not always be required to answer a research question, they certainly should be used whenever learnt conversational skills are being assessed or judged. We can also note that there is no standard for specifying reward functions due to the wide variety of functions used in previous works—almost every paper uses a different reward function. Even when some works use learnt reward functions (e.g. based on adversarial learning), they focus on learning to discriminate between machine-generated and human-generated dialogues without taking other dimensions into accounts such as task success or additional penalties. Although there is advancement in the specification of reward functions by learning them instead of hand-crafting them, this area requires better understanding for optimising different types of dialogues including information-seeking, chitchat, game-based, negotiation-based, etc.

4.3 Emotions modelling

Emotions are essential in vocal human communication, and they have recently received growing interest by the research community (Latif et al. 2019; Wang et al. 2020; Latif 2020; Ali et al. 2021). Arguably, human–robot interaction can be significantly enhanced if dialogue agents can perceive the emotional state of a user and its dynamics (Ma et al. 2020; Majumder et al. 2019). This line of research is categorised into two areas: emotion recognition in conversations (Poria et al. 2019), and affective dialogue generation (Young et al. 2020; Zhou et al. 2018). Speech emotion recognition (SER) can be used as a reward for RL based dialogue systems (Heusser et al. 2019). This would allow the system to adjust the behaviour based on the emotional states of the dialogue partner. Lack of labelled emotional corpora and low accuracy in SER are two major challenges in the field. To achieve the best possible accuracy, various DL-based methods have been applied to SER, however, performance improvement is still needed for real-time deployments. DRL offers different advantages to SER, as highlighted in different studies. In order to improve audio-visual SER performance, Ouyang et al. (2018) presented a model-based RL framework that utilised feedback of testing results as rewards from the environment to update the fusion weights. They evaluated the proposed model on the Multimodal Emotion Recognition Challenge (MEC 2017) dataset and achieved top 2 at the MEC 2017 Audio-Visual Challenge. To minimise the latency in SER, Lakomkin et al. (2018) proposed EmoRL for predicting the emotional state of a speaker as soon as it gains enough confidence while listening. In this way, EmoRL was able to achieve lower latency and minimise the need for audio segmentation required in DL-based approaches for SER. In Sangeetha and Jayasankar (2019), authors used RL with an adaptive fractional deep Belief network (AFDBN) for SER to enhance human-computer interaction. They showed that the combination of RL with AFDBN is efficient in terms of processing time and SER performance. Another study (Chen et al. 2017) utilised an LSTM-based gated multimodal embedding with temporal attention for sentiment analysis. They exploited the policy gradient method REINFORCE to balance exploration and optimisation by random sampling. They empirically show that the proposed model was able to deal with various challenges of understanding communication dynamics.

DRL is less popular in SER compared ASR and SDSs. The above-mentioned studies attempted to help solve different SER challenges using DRL, however, there is still a need for developing adaptive SER agents that can perform SER in the wild using small amounts (few) of samples of data (Latif et al. 2020; Ntalampiras 2021; Latif et al. 2020, 2021). Researchers have been motivated in exploring transfer learning in SER (Ntalampiras 2017) to utilise external knowledge for accelerating the learning process of agents.

4.4 Audio enhancement

The performance of audio-based intelligent systems is critically vulnerable to noisy conditions and degrades according to the noise levels in the environment (Li et al. 2015). Several approaches have been proposed (Latif et al. 2018; Li et al. 2013) to address problems caused by environmental noise. One popular approach is audio enhancement, which aims to generate an enhanced audio signal from its noisy or corrupted version (Wang and Wang 2016). DL-based speech enhancement has attained increased attention due to its superior performance compared to traditional methods (Baby et al. 2015; Wang and Chen 2018).

In DL-based systems, the audio enhancement module is generally optimised separately from the main task such as minimisation of WER. Besides the speech enhancement module, there are different other units in speech-based systems which increase their complexity and make them non-differentiable. In such situations, DRL can achieve complex goals in an iterative manner, which makes it suitable for such applications. Such DRL-based approaches have been proposed in Shen et al. (2019) to optimise the speech enhancement module based on the speech recognition results. Experimental results have shown that DRL-based methods can effectively improve the system’s performance by 12.4% and 19.2% error rate reductions for the signal to noise ratio at 0 dB and 5 dB, respectively. In Koizumi et al. (2017), authors attempted to optimise DNN-based source enhancement using RL with numerical rewards calculated from conventional perceptual scores such as perceptual evaluation of speech quality (PESQ) (Recommendation 2001) and perceptual evaluation methods for audio source separation (PEASS) (Emiya et al. 2011). They showed empirically that the proposed method can improve the quality of the output speech signals by using RL-based optimisation. Fakoor et al. (2017) performed a study in an attempt to improve the adaptivity of speech enhancement methods via RL. They propose to model the noise-suppression module as a black box, requiring no knowledge of the algorithmic mechanics. Using an LSTM-based agent, they showed that their method improves system performance compared to methods with no adaptivity. In Alamdari et al. (2020), the authors presented a DRL-based method to achieve personalised compression from noisy speech for a specific user in a hearing aid application. To deal with non-linearities of human hearing via the reward/punishment mechanism, they used a DRL agent that receives preference feedback from the target user. Experimental results showed that the developed approach achieved preferred hearing outcomes.

Similarly to SER, very few studies have explored DRL for audio enhancement. Most of these studies have evaluated DRL-based methods to achieve a certain level of signal enhancement in controlled environments. Further research efforts are needed to develop DRL agents that can perform their tasks in real and complex noisy environments.

4.5 Music listening and generation

DL models are widely used for generating content including images, text, and music. The motivation for using DL for music generation lies in its generality since it can learn from arbitrary corpora of music and be able to generate various musical genres compared to classical methods (Steedman 1984; Ebcioğlu 1988).

Here, DRL offers opportunities to impose rules of music theory for the generation of more real musical structures (Jaques et al. 2016). Various researchers have explored such opportunities of DRL for music generation. For instance, Kotecha (2018) explored DQN to impose greater global coherence and encourage exploration in music generation. Based on the evaluations, the authors achieved better quantitative and qualitative results using an LSTM-based architecture in generating polyphonic music aligned with musical rules. Jiang et al. (2020) presented an interactive RL-Duet framework for real-time human-machine duet improvisation. They used actor-critic with generalised advantage estimator (GAE) (Schulman et al. 2016) based music generation agent to learn a policy for generating musical note generation based on the previous context. They trained the model on monophonic and polyphonic data and were able to generate high-quality musical pieces compared to a baseline method. Jaques et al. (2016) utilised a deep Q-learning agent with a reward function based on rules of music theory and probabilistic outputs of an RNN. They showed that the proposed model can learn composition rules while maintaining the important information of data learned from supervised training. For audio-based generative models, it is often important to tune the generated samples towards some domain-specific metrics. To achieve this, Guimaraes et al. (2017) proposed a method that combines adversarial training with RL. Specifically, they extend the training process of a GAN framework to include the domain-specific objectives in addition to the discriminator reward. Experimental results show that the proposed model can generate music while maintaining the information originally learned from data, and attained improvement in the desired metrics. In Lee et al. (2017), the authors also used a GAN-based model for music generation and explored optimisation via RL. They used RNN based generator to learn musical distributions from the embedded space and found that the proposed framework was able to generate musically coherent sequences with improved quantitative and qualitative measures. RaveForce (Lan et al. 2019) is a DRL-based environment for music generation, which can be used to search new synthesis parameters for a specific timbre of an electronic musical note or loop.

Score following is the process of tracking a musical performance for a known symbolic representation (a score). Dorfer et al. (2018) modelled score following task with DRL algorithms such as synchronous advantage actor-critic (A2C). They designed a multimodal RL agent that listens to music, reads the score from an image and follows the audio in an end-to-end fashion. Experiments on monophonic and polyphonic piano music showed promising results compared to state-of-the-art methods. The score following task is studied in Henkel et al. (2019) using the A2C and proximal policy optimisation (PPO). This study showed that the proposed approach could be applied to track real piano recordings of human performances.

4.6 Human robot interaction (HRI)

There is a recent growing research interest in robotics to enable robots with abilities such as recognition of users’ gestures and intentions (Howard and Cambria 2013), and generation of socially appropriate speech-based behaviours (Goodrich and Schultz 2007). In such applications, RL is suitable because robots are required to learn from rewards obtained from their actions. Different studies have explored different DRL-based approaches for audio and speech processing in robotics. Gao et al. (2020) simulated an experiment for the acquisition of spoken language to provide a proof-of-concept of Skinner’s idea (Skinner et al. 1957), which states that children acquire language based on behaviourist reinforcement principles by associating words with meanings. Based on their results, the authors were able to show that acquiring spoken language is a combination of observing the environment, processing the observation, and grounding the observations with their true meaning through a series of reinforcement attempts. In Yu et al. (2018), authors build a virtual agent for language learning in a maze-like world. It interactively acquires the teacher’s language from question answering sentence-directed navigation. Some other studies (Sinha et al. 2019; Hermann et al. 2017; Hill et al. 2018) in this direction have also explored RL-based methods for spoken language learning.

In human–robot interaction, researchers have used audio-driven DRL for robot gaze control and dialogue management. In Lathuilière et al. (2019), the authors used Q-learning with DNNs for audio–visual gaze control with the specific goal of finding good policies to control the orientation of a robot head towards groups of people using audio-visual information. Similarly, the authors of Lathuilière et al. (2018) used a deep Q-network taking into account visual and acoustic observations to direct the robot’s head towards targets of interest. Based on the results, the authors showed that the proposed framework generates state-of-the-art results. Clark-Turner and Begum (2018) proposed an end-to-end learning framework that can induce generalised and high-level rules of human interactions from structured demonstrations. They empirically show that the proposed model was able to identify both auditory and gestural responses correctly. Another interesting work (Hussain et al. 2019) utilised a deep Q-network for speech-driven backchannels like laugh generation to enhance engagement in human–robot interaction. Based on their experiments, they found that the proposed method has the potential of training a robot for engaging behaviours. Similarly, Hussain et al. (2019) utilised recurrent Q-learning for backchannel generation to engage agents during human–robot interaction. They showed that an agent trained using off-policy RL produces more engagement than an agent trained from imitation learning. In a similar strand, Bui and Chong (2019) have applied a deep Q-network to control the speech volume of a humanoid robot in environments with different amounts of noise. In a trial with human subjects, participants rated the proposed DRL-based solution better than fixed-volume robots. DRL has also been applied to spoken language understanding (Zamani et al. 2018), where a deep Q-network receives symbolic representations from an intent recogniser and outputs actions such as (keep mug on sink). In Qureshi et al. (2018), the authors trained a humanoid robot to acquire social skills for tracking and greeting people. In their experiments, the robot learnt its human-like behaviour from experiences in a real uncontrolled environment. In Cuayáhuitl (2020), they propose an approach for efficiently training the behaviour of a robot playing games using a very limited amount of demonstration dialogues. Although the learnt multimodal behaviours are not always perfect (due to noisy perceptions), they were reasonable while the trained robot interacted with real human players. Efficient training has also been explored using interactive feedback from human demonstrators as in Moreira et al. (2020), who show that DRL with interactive feedback leads to faster learning and with fewer mistakes than autonomous DRL (without interactive feedback).

Robotics plays an interesting role in bringing audio-based DRL applications together including all or some of the above. For example, a robot recognising speech and understanding language (Zamani et al. 2018), aware of emotions (Lakomkin et al. 2018), carry out activities such as playing games (Cuayáhuitl 2020), greeting people (Qureshi et al. 2018), or playing music (Fryen et al. 2020), among others. Such a collection of DRL agents are currently trained independently, but we should expect more connectedness between them in the future.

4.7 Other applications

Besides the above-mentioned applications, DRL has also been being explored in various audio-based domains including audio localisation, audio scene analysis, speech synthesis, soundscape, and bio-acoustics. In these domains, we found very few studies that focused on DRL. Speech synthesis, also known as text-to-speech (TTS), is an important audio technology that aims to generate human-like natural-sounding speech using text data as input (Latif et al. 2021). Most of the neural speech synthesis systems utilise linguistic or acoustic features as an intermediate representation to generate speech. In the speech synthesis domain, deep end-to-end models (e.g., Shen et al. 2018; Łańcucki et al. 2021; Ren et al. 2019) have attained considerable attention by significantly enhancing the quality of synthesised speech (Latif et al. 2021). Recently, some studies have explored DRL for TTS. For instance, Liu et al. (2021) used RL for emotional speech synthesis. The authors focused on the use of RL to solve the problem of emotion confusion in TTS systems via interaction between the model and SER. In their experiments, they found that the proposed framework outperformed the state-of-the-art baselines by improving the emotion discriminability of synthesised speech. Mohan et al. (2020) used RL for deciding interleaved actions in sequence-to-sequence models for incremental TTS. Based on their results, the authors found that RL agents can successfully balance the trade-off between the quality of the synthesised speech against the latency of generation. Chung et al. (2021) presented a Reinforce-Aligner—an RL based alignment search agent that can perform optimal duration predictions based on the actions and cumulative rewards. Results showed that the proposed framework can perform accurate alignments of phoneme-to-frame sequence, which help improve the naturalness and fidelity of synthesised speech. DRL has also been applied to bioacoustics (Ntalampiras 2018) and sound emotion (Huang et al. 2019), which need further research.

Research studies have also explored the potentials of DRL for solving audio localisation problems. For instance, Giannakopoulos et al. (2021) trained an autonomous agent that navigates in a two-dimensional space using audio information from a multi-speaker environment. Based on their results, the authors found that the agent can successfully localise the target speaker among a set of predefined speakers in a room by avoiding confusion and going outside the predefined room boundaries. Self-supervised learning (SSL) has been actively studied recently in various research fields including audio, text, vision, and many more (Xin et al. 2020; Latif et al. 2021, 2022). (Gonzalez-Billandon et al. 2020) used a self-supervised RL-based iCub humanoid robot for speaker localisation in an autonomous way. During experimentation, they created a dataset of audio and location mapping which can be utilised to train an agent for accurate and robust speaker localisation. Seurin et al. (2020) presented an RL-based interactive speaker recognition system that aims to improve its performance by requesting personalised utterances to learn speaker representations. They empirically showed that the proposed architecture improves speaker identification compared to the non-interactive baseline models. Shah (Shah et al. 2018) et al. presented FollowNet, a DRL agent that navigates following natural language directions. They empirically showed that FollowNet can successfully navigate by learning to execute previously unseen instructions with a 30% improvement in results over a baseline. Some other studies have also exploited DRL methods for audio-visual navigation (Chen et al. 2020, 2019; Gan et al. 2020) and source separation (Majumder et al. 2021).

In reinforcement learning applications, defining an appropriate reward function to achieve the desired behaviour is challenging. Inverse reinforcement learning (IRL) facilities an automatic way of finding a reward function based on the given set of trajectories in the environment (Ng et al. 2000; Abbeel and Ng 2004). A few studies have explored IRL to impose a learnt reward function, instead manually defined, for dialogue control (Sugiyama et al. 2012) and interactive systems. However, further research efforts are required in the audio domain for designing optimal reward functions.

5 Challenges in audio-based DRL

Fig. 6
figure 6

A summary of audio-based DRL connecting the application areas and algorithms described in the previous two sections—the coloured circles correspond to the three groups of algorithms (from left to right: value-based, policy-based, model-based). Since the lack of connections between areas and algorithms denote no or little attention in previous works, the large amount of disconnections suggest opportunities for exploring different algorithms or more comprehensively in order to find the best algorithm(s) for different areas and tasks within each area

Fig. 7
figure 7

A pictorial view of previous works on audio-based DRL and potential dimensions to explore in future systems. The inner cube refers to the fact that dimension Z is less developed than the other two dimensions (X,Y)

The research works in the previous section have focused on a narrow set of DRL algorithms and have ignored the existence of many other algorithms, as can be noted in Fig. 6. This suggests the need for a stronger collaboration between core DRL and audio-based DRL, which may be already happening. Figure 7 help us to illustrate that previous works have only explored a modest space of what is possible. Based on the related works above, we have identified three main challenges that need to be addressed by future systems. Those dimensions converge in what we call ‘very advanced systems’.

5.1 Real-world audio-based systems

Most of the DRL algorithms described in Sect. 3 carry out experiments on the Atari benchmark (Bellemare et al. 2013), where there is no difference between training and test environments. This is an important limitation in the literature, and it should be taken into account in the development of future DRL algorithms. Nonetheless, those efforts have been worth making progress in core DRL research, which have the potential of influencing a large amount of applications. In contrast, audio-based DRL applications tend to make use of a more explicit separation between training and test environments. While audio-based DRL agents may be trained from offline interactions or simulations, their performance requires to be assessed using a separate set of offline data or real interactions. The latter (often referred to as human evaluations) is very important for analysing and evidencing the quality of learnt behaviours. Learning behaviour offline is typically preferred for two main reasons: (i) large training times for inducing the best possible behaviour; and (ii) to avoid nonsensical or incoherent behaviour, due to exploration strategies used during training, unless one has is a mechanism in place to assure reasonable behaviour during online learning. In almost all (if not all) audio-based systems, the creation of data is difficult and expensive. This highlights the need for more data-efficient algorithms—especially if DRL agents are expected to learn from real data instead of synthetic data. In high-frequency audio-based control tasks, DRL agents have the requirements of learning fast and avoiding repeating the same mistake. Real-world audio-based systems require algorithms that are sample efficient and performant in their operations. This makes the application of DRL algorithms in real systems very challenging. Some studies such as (Finn et al. 2017; Chua et al. 2018; Buckman et al. 2018), have presented approaches to improve the sample efficiency of DRL systems. These approaches, however, have not been applied to audio-based systems. This suggests that much more research is required to make DRL more practical and successful for its application in real audio-based systems.

5.2 Knowledge transfer and generalisation

Learning behaviours from complex signals like speech and audio with DRL requires processing high-dimensional inputs and performing extensive training on a large number of samples to achieve improved performance. The unavailability of large labelled datasets is indeed one of the major obstacles in the area of audio-driven DRL (Purwins et al. 2019; Latif et al. 2020). Moreover, it is computationally expensive to train a single DRL agent, and there is a need for training multiple DRL agents in order to equip audio-based systems with a variety of learnt skills. Therefore, some researchers have turned their attention to studying different schemes such as policy distillation (Rusu et al. 2015), progressive neural networks (Rusu et al. 2016), multi-domain/multi-task learning (Cuayáhuitl et al. 2017; Ultes et al. 2017; Li et al. 2015; Jaderberg et al. 2016) and others (Yin and Pan 2017; Nguyen et al. 2020; Glatt et al. 2016) to promote transfer learning and generalisation in DRL to improve system performance and reduce computational costs. Only a few studies in dialogue systems have started to explore transfer learning in DRL for the speech, audio and dialogue domains (Mo et al. 2018; Carrara et al. 2018; Chen et al. 2018; Narasimhan et al. 2018; Ammanabrolu and Riedl 2019), and more research is needed in this area. When large amounts of data exist, one could opt for ignoring knowledge transfer—but most of the time this is not the case. In the presence of small or medium-size datasets, it is worth considering the idea of transferring knowledge induced from other datasets to the one at hand. DRL agents are often trained from scratch instead of inheriting useful behaviours from other agents. Some agents from Table 4 [such as (Williams and Zweig 2016; Liu et al. 2017; Zorrilla et al. 2021)] have avoided learning from scratch by showing that applying DRL on top of non-DRL or supervised methods yields improved performance due to the optimisation element that DRL brings instead of only mimicking demonstration data. But those systems typically focus a single dataset and the idea of transferring useful and effective knowledge from other/many tasks to a new or targeted task remains to be demonstrated. Research efforts in these directions would contribute towards more practical, cost-effective, and robust applications of audio-based DRL agents. On the one hand, to train agents less data-intensively, and on the other to achieve reasonable performance in the real world.

Table 4 Summary of research papers on dialogue systems trained with DRL algorithms

5.3 Multi-agent and truly autonomous systems

Audio-based DRL has achieved impressive performance in single-agent domains, where the environment stays mostly stationary. But in the case of audio-based systems operating in real-world scenarios, the environments are typically challenging and dynamic. For instance, multi-lingual ASR and spoken dialogue systems need to learn policies for different languages and domains. These tasks not only involve a high degree of uncertainty and complicated dynamics but are also characterised by the fact that they are situated in the real physical world, thus have an inherently distributed nature. The problem, thus, falls naturally into the realm of multi-agent RL (MARL), an area of knowledge with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques (Littman 1994; Hernandez-Leal et al. 2019). Coupled with recent advances in DNNs, MARL has been in the limelight for many recent breakthroughs in various domains including control systems, communication networks, economics, etc. However, applications in the audio processing domain are relatively limited due to various challenges. The learning goals in MARL are multidimensional—because the objectives of all agents are not necessarily aligned. This situation can arise for example in simultaneous emotion and speaker voice recognition, where the goal of one agent is to identify emotions and the goal of the other agent is to recognise the speaker. As a consequence, these agents can independently perceive the environment, and act according to their individual objectives (rewards) thus modifying the environment. This can bring up the challenge of dealing with equilibrium points, as well as some additional performance criteria beyond return-optimisation, such as the robustness against potential adversarial agents. As all agents try to improve their policies according to their interests concurrently, therefore the action executed by one agent affects the goals and objectives of the other agents (e.g. speaker, gender, and emotion identification from the speech at the same time), and vice-versa.

One remaining challenging aspect is that of autonomous skill acquisition. Most, if not all, DRL agents currently require a substantial amount of pre-programming as opposed to acquiring skills autonomously to enable personalised/extensible behaviour. Such pre-programming includes explicit implementations of states, actions, rewards, and policies. Examples of pre-programing agents are as follows: implementing a particular combination of features derived from audio/word/sentence embeddings among others; implementing a particular set of dialogue actions instead of learned ones; implementing a particular reward function focused on optimising task success and dialogue length instead of other factors; and implementing a policy using purely learnt behaviour instead of rule-based and DRL-based or supervised-based and DRL-based, among others. Pre-programming is needed due to not or partially knowing what the best representations are for different tasks. As agents become more advanced, those representations of states, actions, rewards and policies will be better known across tasks and therefore the amount of pre-programming will be reduced. Although substantial progress in different areas has been made, the idea of creating audio-driven DRL agents that autonomously learn their states, actions, and rewards in order to induce useful skills remains to be investigated further across applications. Such kind of agents would have to know when and how to observe their environments, identify a task and input features, induce a set of actions, induce a reward function (from audio, images, or both among others), and use all of that to train policies. Such agents have the potential to show advanced levels of intelligence, and they would be very useful for applications such as personal assistants or interactive robots.

6 Summary of audio-based DRL research and future directions

This literature review shows that DRL is becoming popular in audio processing and related applications. We collected DRL research papers in six different but related areas: automatic speech recognition (ASR), speech emotion recognition (SER), spoken dialogue systems (SDSs), audio enhancement, audio-driven robotic control, and music generation. A summary of our findings for each area is given below.

  1. 1.

    In ASR, most of the studies have used policy gradient-based DRL, as it allows learning an optimal policy that maximises the performance objective. We found studies aiming to solve the complexity of ASR models (Dudziak et al. 2019), tackle slow convergence issues (Williams 1992), and speed up the convergence in DRL (Rajapakshe et al. 2020).

  2. 2.

    The development of SDSs with DRL is gaining interest and different studies have shown very interesting results that have outperformed current state-of-the-art DL approaches (Weisz et al. 2018). However, there is still room for improvement regarding the effective and practical training of DRL-based spoken dialogue systems.

  3. 3.

    Several studies have also applied DRL to emotion recognition and empirically showed that DRL can (i) lower latency while making predictions (Lakomkin et al. 2018), (ii) understand emotional dynamics in communication (Sangeetha and Jayasankar 2019), and (iii) enhance human-computer interaction (Chen et al. 2017).

  4. 4.

    In the case of audio enhancement, studies have shown the potential of DRL. While these studies have focused their attention on the speech signals, DRL can be used to optimise the audio enhancement module along with performance objectives such as those in ASR (Shen et al. 2019).

  5. 5.

    In music generation, DRL can optimise rules of music theory as validated in different studies (Jaques et al. 2016; Guimaraes et al. 2017). It can also be used to search for new tone synthesis parameters (Lan et al. 2019). Moreover, DRL can be used to perform score following to track a musical performance (Dorfer et al. 2018), and it is even suitable for tracking real piano recordings (Henkel et al. 2019), among other possible tasks.

  6. 6.

    In robotics, audio-based DRL agents are in their infancy. Previous studies have trained DRL-based agents using simulations, which have shown that reinforcement principles help agents in the acquisition of spoken language. Some recent works (Hussain et al. 2019, 2019) have shown that DRL can be utilised to train gaze controllers and speech-driven backchannels like laughs in human–robot interaction—and this is only the beginning of larger-scale embodied DRL-based agents.

The related works reviewed above highlight several benefits of using DRL for audio processing and applications. Challenges remain before such advancements will succeed in the real world, including endowing agents with commonsense knowledge, knowledge transfer, generalisation, and autonomous learning, among others—see Fig. 7. Such advances need to be demonstrated not only in simulated and stationary environments but in real and non-stationary ones as in real-world scenarios. Steady progress, however, is being made in the right direction for designing more adaptive audio-based systems that can be better suited for real-world settings. If such scientific progress keeps growing rapidly, perhaps we are not too far away from AI-based autonomous systems that can listen, process, and understand audio and act in more human-like ways in increasingly complex environments. In Table 5, we compare different DRL toolkits in terms of implemented algorithms, which aim to help researchers to select suitable tools to study DRL techniques.

Table 5 Comparing DRL libraries based on the state-of-the-art implemented algorithms

7 Conclusions

In this work, we have focused on presenting a comprehensive review of deep reinforcement learning (DRL) techniques for audio based applications. We reviewed DRL research works in six different audio-related areas including automatic speech recognition (ASR), speech emotion recognition (SER), spoken dialogue systems (SDSs), audio enhancement, audio-driven robotic control, and music generation. In all of these areas, the use of DRL techniques is becoming increasingly popular, and ongoing research on this topic has explored many DRL algorithms with encouraging results for audio-related applications. Apart from providing a detailed review, we have also highlighted (i) various challenges that hinder DRL research in audio applications and (ii) various avenues for exciting future research. We hope that this paper will help researchers and practitioners interested in exploring and solving problems in the audio and related areas using DRL techniques.