Reinforcement Learning-Based Spectrum Management for Cognitive Radio Networks: A Literature Review and Case Study

  • Marco Di FeliceEmail author
  • Luca Bedogni
  • Luciano Bononi
Living reference work entry


In cognitive radio (CR) networks, the cognition cycle, i.e., the ability of wireless transceivers to learn the optimal configuration meeting environmental and application requirements, is considered as important as the hardware components which enable the dynamic spectrum access (DSA) capabilities. To this purpose, several machine learning (ML) techniques have been applied on CR spectrum and network management issues, including spectrum sensing, spectrum selection, and routing. In this paper, we focus on reinforcement learning (RL), an online ML paradigm where an agent discovers the optimal sequence of actions required to perform a task via trial-end-error interactions with the environment. Our study provides both a survey and a proof of concept of RL applications in CR networking. As a survey, we discuss pros and cons of the RL framework compared to other ML techniques, and we provide an exhaustive review of the RL-CR literature, by considering a twofold perspective, i.e., an application-driven taxonomy and a learning methodology-driven taxonomy. As a proof of concept, we investigate the application of RL techniques on joint spectrum sensing and decision problems, by comparing different algorithms and learning strategies and by further analyzing the impact of information sharing techniques in purely cooperative or mixed cooperative/competitive tasks.


A cognitive radio (CR) can be defined as a wireless device that is able to autonomously control its configuration based on the environmental conditions and on the quality of service (QoS) requirements of the applications [1]. Since its original proposal in 1999 [2], the node architecture has been considered the core novelty of a CR device, being the fusion of advanced dynamic spectrum access (DSA) functionalities at the radio level and of intelligent decision-making provided by a cognition module (CM) at the software level. Through the DSA, a CR is able to observe the network environment and to dynamically adjust transmission parameters like the operative frequency, the modulation and coding scheme, or the power level. To this purpose, the dynamic reuse of vacant portion of the licensed spectrum, in overlay or underlay mode, has emerged as the prominent use-case of the CR technology: CR devices, also known as secondary users (SUs), aim to maximally exploit all the available spectrum frequencies, including both licensed and unlicensed ones, without affecting the performance of the frequency owners, also known as primary users (PUs) [1]. The research literature on channel sensing techniques, required to detect PU-free transmission opportunities on the frequency and time domains, is vast [3, 4], as well as the number of proposed network architecture and standards regulating the operations of the SUs [5, 6]. On top of the DSA module, the CM leverages the perceptions and measurements gathered during the sensing phase for the decision-making process, i.e., to properly adjust the radio configuration and plan the network operations, by means of advanced learning and reasoning functionalities [7]. For this reason, a significant portion of the literature on CR networking is investigating the utilization of machine learning (ML) techniques [8] for the device and network configuration, optimization, and planning; the ML approaches adopted so far are extremely heterogeneous and include supervised learning techniques (e.g., neural networks and Bayesian classifiers), unsupervised learning techniques, and dynamic games [9, 10, 11].

In this paper, we focus on reinforcement learning (RL) [12, 13], a well-known ML paradigm where the agent learns the optimal sequence of actions in order to fulfill a specific task via trial-and-error interactions with a dynamic environment; at each action performed, the agent observes its current state and receives a numeric reward, which quantifies the effectiveness of the action. The agent behavior, also known as the policy, should choose actions that tend to increase the long-term sum of rewards [13]. The literature on RL dates back to the 1960s [12] and comprises several different techniques and variants [14, 15, 16, 17]. The online nature of the learning process fits well the architecture of a CR device: the DSA module provides context-awareness via explicit feedbacks and channel measurements, and based on such rewards, the RL-CM is able to learn the optimal state-action mapping. Differently from supervised learning [8], RL algorithms might work without assuming any previous knowledge of the environment and of the reward function [11]. At the same time, a RL agent continuously adjusts its current policy based on the interactions with the environment: hence, policy adaptiveness is implicitly addressed also in dynamic and nonstationary environments.

This property is particularly interesting in CR networking scenarios, which are dynamic by nature due to the mobility of the SU devices, the PU activity patterns, and the likely varying propagation and traffic load conditions, and constitutes another significant advantage compared to traditional optimization approaches. Thanks to these benefits, several recent works have demonstrated that RL techniques can be applied on spectrum management issues [18, 19, 20], including channel sensing, channel selection, or power control problems, as well as on many CR network management issues, including routing, cooperation control, and security [21, 22]. At the same time, the application of RL techniques in CR scenarios hides a number of technical challenges, like the impact of exploration phase on the system performance Ozekin et al. [23, 24] and the convergence in distributed environments characterized by the presence of SUs that compete for shared resources (e.g., channel frequency) while cooperating on keeping the aggregated interference below a predefined QoS threshold [25, 26].

This paper investigates the application of RL techniques on CR networking by providing two kinds of scientific contributions, i.e., (i) a survey of the RL-CR-related literature, which can serve also as a tutorial for readers approaching the topic for the first time, and (ii) a proof of concepts of RL techniques on novel CR use-cases. Regarding the survey/tutorial, after a brief presentation of the RL theory and of the main algorithms, we discuss advantages and drawbacks of the RL framework for CR networking, and we compare it against other ML approaches. We then provide an up-to-date and exhaustive review of the RL-CR-related literature through a twofold taxonomy. The first taxonomy is based on the CR application domains, focusing on spectrum management and network configuration issues (i.e., spectrum sensing, decision, power allocation, and routing); on each category, we further classify the studies according to the proposed goal being addressed. The second taxonomy is learning methodology-driven, i.e., we review the literature according to specific RL modeling features which are orthogonal to the application domain, like the modeling of the environment and of the reward function. Regarding the proof of concepts, we describe a novel application of RL techniques on joint channel sensing and decision problems (hence, combining two research issues which are treated separately in the survey): more specifically, we show how the SUs can autonomously learn the optimal channel allocation, as well as the optimal balance of sensing/transmitting actions on each channel, so that the secondary network performance are maximized, while the harmful interference to PU receivers are kept below a QoS threshold. We formulate the problem as an instance of a Markov decision process (MDP) [12, 13], and we tested different algorithms (Q-Learning and Sarsa) and learning models, on two different task goals: independent learning and collaborative agents on a fully cooperative task (e.g., PU-SU interference minimization) and distributed coordinating agents on a mixed cooperative/competitive task (e.g., SU-SU and PU-SU interference minimization). The experimental results show that RL-based solutions can greatly enhance the performance in dynamic CR environments compared to non-learning-based solutions; at the same time, they unveil the impact of RL parameter tuning, knowledge sharing techniques, and algorithm selection, hence paving the way to further researches on the topic.

The rest of the paper is structured as follows. Section “Related Works” reviews the existing surveys addressing ML and RL applications in CR networks and points out the novelties of this paper. Section “Overview of Reinforcement Learning” provides an overview of the RL theory, by introducing a taxonomy of the existing techniques and by also summarizing the operations of the most popular RL algorithms. Advantages and drawbacks of RL-CR approaches are discussed in section “Reinforcement Learning in Cognitive Radio Scenarios: Pros and Cons”. Section “Reinforcement Learning in Cognitive Radio Scenarios: Applications-Driven Taxonomy” reviews the existing RL-CR studies according to an application-driven taxonomy. The existing works are further classified by means of a learning methodology-driven taxonomy in section “Reinforcement Learning in Cognitive Radio Scenarios: Learning Methodology-Driven Taxonomy”. The case study is presented in section “Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks”, together with the RL formulation, proposed algorithms and performance evaluation results. Conclusions are drawn in section “Conclusions and Open Issues”.

Related Works

The most comprehensive surveys investigating the applications of ML techniques on CR networking are probably [9, 10], and [11]. More specifically, [10] describes the existing applications of ML techniques on CR networking, considering both supervised and unsupervised learning techniques and including also the RL-based approaches. Moreover, the authors investigate the learning challenges in non-Markovian environments and discuss policy-gradient algorithms. An impressive review of model-free learning-based solutions in CR networks is presented in [11], where the existing works are grouped in three main categories, i.e.: (i) strategy-learning schemes based on single-agent systems, (ii) strategy-learning schemes based on loosely coupled multi-agent systems, and (iii) strategy-learning schemes in the context of games. In [9], the authors survey the ML-CR literature by considering an interesting distinction between learning aspects of cognition – which include RL and dynamic games – and reasoning aspects. These latter are in charge of applying inference on the acquired and the learned knowledge, hence enriching the current knowledge base; applications of policy-based reasoning to predict spectrum handover operations or to enhance spectrum opportunity detections are evaluated in a test-bed [9]. The strict relationship occurring between learning and reasoning in CR networks is also investigated in [7]. By focusing on the RL-CR literature, the authors of [20] demonstrate how the RL framework, and in particular the Q-routing algorithm, can be utilized as modeling tool in four different problems, regarding dynamic channel selection (DCS), DCS and route selection, DCS and congestion control, and packet scheduling in QoS environments. Similarly, the authors of [18] show how three different CR problems (routing, channel sensing, and decision) can be modeled via the Markov decision process (MDP) introduced by the RL framework. Applications, implementations, and open issues of RL techniques in CR networks are extensively discussed in [19], which is the work most similar to our paper. Our paper provides two additional contributions compared to [19]: (i) it provides an up-to-date review of the RL-CR literature from two different perspective, i.e., a CR networking perspective and a learning perspective, and (ii) it evaluates gains and drawbacks of the RL framework on a realistic CR use-case, addressing joint spectrum sensing, and selection.

Overview of Reinforcement Learning

Reinforcement learning (RL) constitutes an area of machine learning (ML) [8] addressing the problem of an agent that must determine the optimal sequence of actions to perform over time, so that a predefined goal is achieved [12, 13]. Differently from supervised techniques, the learning process is based on trial-error interactions with the environment, i.e., at each action, the agent receives a numeric reward which is a proxy for its optimality (see Fig. 1). The optimal sequence is the one maximizing the summation of the expected rewards received by the agent over time.
Fig. 1

The reinforcement learning (RL) model

More formally, the RL problem can be modeled by using a discrete Markov decision process (MDP), represented by the tuple < S, A, R, ST > , where:
  • S is the (discrete) set of available States; let st denote the current state of the agent at time t.

  • A is the (discrete) set of Actions; let A(st) denote the set of actions available in state st.

  • \(R: S \times A \rightarrow \Re \) is the Reward function indicating the numeric reward received at each state/action; more specifically, let rt indicate the reward received by the agent while being in state st and executing action at ∈ A(st).

  • ST : S × A → S is the State Transition function, which indicates the next state \(s^{\prime }_{t+1}\) after executing action at ∈ A(st) from state st; in case of nondeterministic environments, the ST function is a probabilistic distribution over the set of actions and states, i.e., ST : S × A × S → [0 : 1].

Another component of the RL framework is the policy function π : S → A, which indicates, for each state st, the proper action at to execute. Similarly to the state transition function, also the policy function can be modeled as a probabilistic distribution over the set of actions and states, i.e., π : S × A → [0 : 1]. The goal of the agent is to discover the optimal policy π which maximizes a specific function of the received rewards over time. In the infinite-horizon discount model [13], the policy aims to maximize the long-run expected reward; however, it discounts the rewards received in the future, i.e.:
$$\displaystyle \begin{aligned} goal \rightarrow maximize \ E\left(\sum_{t=0}^{\infty}\gamma^t \cdot r_t \right) \end{aligned} $$
where 0 ≤ γ ≤ 1 is a factor discounting the future rewards. If γ = 0, the agent aims to maximize the immediate rewards.In order to compute the optimal policy, several RL algorithms employ two additional data structures: the state-value function (Vπ) and the state-action function (Qπ) [12, 13]. For each state s ∈ S, the state-value function Vπ(st) represents the expected reward when following policy π from state st. The Vπ(st) value can be computed as follows:
$$\displaystyle \begin{aligned} V^\pi(s_t)=\sum_{a_t \in A(s_t)} \pi(s_t,a_t) \cdot \sum_{s' \in S} ST(s_t,a_t,s') \cdot \left(R(s_t,a_t) +\gamma \cdot V^\pi(s')\right) \end{aligned} $$
Analogously, the state-action function Qπ(st, at) represents the expected reward when the agent is in state st, executes action at, and then follows the policy π. More formally:
$$\displaystyle \begin{aligned} Q^\pi(s_t,a_t)= \sum_{s' \in S} ST(s_t,a_t,s') \cdot \left(R(s_t,a_t) +\gamma \cdot V^\pi(s')\right) \end{aligned} $$
RL techniques can be classified into single-agent RL (SARL) and multi-agents RL (MARL) (see Fig. 2); the main characteristics of each approach are illustrated in the following section.
Fig. 2

Taxonomy of reinforcement learning (RL) algorithms

SARL Algorithms

In a SARL framework, each agent acts independently and aims to maximize its long-run expected reward (Eq. 1). Different techniques have been proposed in order to determine the optimal policy π, including dynamic programming (DP), Monte Carlo-based, and temporal-difference (TD) learning algorithms. DP techniques assume a perfect knowledge of the environment, i.e., of the reward (R) and of the state transition (ST) functions; hence, the exact value of Vπ(⋅) can be computed by solving Eq. 2. The DP algorithms alternate between a policy-evaluation phase, during which the value of the current policy Vπ(s) is determined for each state s ∈ S, and a policy improvement phase, where the current policy π is modified into π′ so that π′(s) = argmaxaAQπ(s, a) [12]. Monte Carlo methods do not assume the knowledge of the environment, but they are mainly used on episodic tasks [12]. Vice versa, TD methods implement an online, step-by-step learning process without assuming a model of the environmental dynamics. Among the several existing TD-based solutions, we cite the popular Sarsa and Q-learning algorithms [16, 17]: they both update the Q-table after each received reward, till converging to the optimal Q values. More specifically, each time the agent chooses action at from state st (receiving reward rt), and action at+1 from next state st+1, the Sarsa algorithm [17] updates the Q(st, at) entry as follows:
$$\displaystyle \begin{aligned} Q(s_t,a_t)=Q(s_t,a_t) + \alpha \cdot \left [ r_t+ \gamma \cdot Q(s_{t+1},a_{t+1}) - Q(s_t,a_t) \right] \end{aligned} $$
where α is a learning rate factor. The Q-learning algorithm [16] employs a slightly different update rule, since it is independent from the policy being followed (offline policy learning), i.e.:
$$\displaystyle \begin{aligned} Q(s_t,a_t)=Q(s_t,a_t) + \alpha \cdot \left [ r_t+ \gamma \cdot \mathrm{max}_{a_{t+1} \in A(s_{t+1})}Q(s_{t+1},a_{t+1}) - Q(s_t,a_t) \right] \end{aligned} $$
Both Sarsa and Q-learning algorithms are guaranteed to converge to the optimal Q values, under the assumption that all state-action pairs are visited an infinite number of times, and proper tuning of the α factor [16, 17]. This poses a challenging trade-off between exploration and exploitation actions, i.e.: (i) insufficient exploration might affect the convergence to the optimal Q values, while (ii) excessive exploration might determine performance fluctuations caused by the selection of random actions. A well-known approach to balance exploration and exploitation actions is via the Boltzmann Equation [12], which assigns a probability to each action and state as a graded function of the estimated Q(s, a) value:
$$\displaystyle \begin{aligned} p(s,a)=\frac{e^{Q(s,a)/TE}}{\sum_{a' \in A(s)}e^{Q(s,a')/TE}} \end{aligned} $$
where TE > 0 is the temperature parameter and controls the exploration/exploitation phases. Indeed, high temperature values cause the actions to be all equiprobable, while, if TE → 0, the greedy action a associated to the highest Q(s, a) value is always selected, for each state s ∈ S.

MARL Algorithms

The MARL framework generalizes the MDP to the case of a multi-agent environment. Let N be the number of learning agents and Si and Ai be the state and action sets for agent i. The state of the MARL at time t, \(\widehat {s_t}\), is then defined as the combination of the individual states of the agents, i.e., \(\widehat {s_t}=\{s_t^1, s_t^2, \ldots s_t^N\}\). Similarly, the system action \(\widehat {a_t}\) is defined as the combination of the individual actions performed by the agents, i.e., \(\widehat {a_t}=\{a_t^1, a_t^2, \ldots a_t^N\}\); based on \(\widehat {a_t}\) and \(\widehat {s_t}\), a vector of rewards is produced, i.e., \(\widehat {r_t}=\{r_t^1, r_t^2, \ldots r_t^N\}\). According to the way such rewards are computed, and to the interactions among the agents, [14, 15] further classify MARL techniques as fully cooperative, fully competitive, or hybrid. In the first case, all the agents receive the same reward, i.e., \(r_t^1=\ldots =r_t^N\), and the goal is to determine the optimal joint policy maximizing a common discounted return; although such policy could be also determined via SARL techniques assuming that all the agents keep the full Q-table of \(\widehat {s}\) and \(\widehat {a}\) values, most of the MARL algorithms work by decomposing the Q-table and introducing indirect coordination mechanisms [27]. In fully competitive MARL frameworks, a min-max principle can be applied, for instance, when N = 2, \(r_t^1=+\zeta \) and \(r_t^1=-\zeta \) [14]. Finally, hybrid MARL techniques apply on not fully cooperative nor fully competitive problems, where the reward function might assume a complex shape depending on the joint action being implemented by the agents; this is the case, for instance, of agents competing for a shared resource, like SUs determining the optimal channels where to transmit and taking into account the interference caused by other players (we further investigate such use-case in section “Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks”). Hybrid MARL frameworks usually employ distributed coordination techniques derived from the game theory. We do not further elaborate on MARL techniques; interested readers can refer to [14] and [15] for a detailed illustration.

Reinforcement Learning in Cognitive Radio Scenarios: Pros and Cons

In CR networking, the cognition cycle, i.e., the ability of wireless transceivers to learn the optimal configuration meeting the characteristics of the environment and the QoS requirements of the applications, is considered as important as the hardware components which enable the spectrum reconfiguration capabilities. To this purpose, several ML techniques have been applied on CR-related use-cases [10], like spectrum sensing, spectrum selection, or routing; beside the RL techniques, which are the main topic of the paper, we cite approaches based on the game theory (GT), neural networks (NN), or Bayesian classifier (BC). On the SCOPUS database, we counted around 400 scientific papers addressing ML-based approaches in CR networks1: 23% of them are based on RL schemes, more than the supervised learning schemes but still less than GT-based approaches. In any case, Fig. 3 shows that there is no ML solution fitting all the solutions. This is because RL techniques provide clear advantages but also formidable drawbacks when applied on CR-related use-cases. About the advantages, RL techniques can considered highly suitable for CR applications because of these characteristics:
  1. 1.
    Experience-based Learning. In supervised learning, a cognitive agent must be instructed on how to perform a classification task by means of a knowledge base containing both positive and negative instances. In CR-related applications, building the knowledge base from real experiments can pose practical issues in terms of scalability and costs. Another issue pertains to the generalization of the learning process, i.e., to the problem of classifying novel instances which are considerably different from those occurring in the knowledge base. This aspect is particularly critic in CR environments, since the network performance is affected by a high number of parameters and by environmental conditions (like the PU activity model, the SU traffic load, the channel error rate, etc.); as a result, a transmitting policy learnt by a CR agent via supervised techniques might not be effective on a different network scenarios or even on the same scenario in presence of dynamic changes of environmental conditions. Vice versa, RL techniques do not require the creation of a knowledge base, rather they leverage on trial-and-error interactions with the environments. In addition, some model-free algorithms like Sarsa and Q-learning [11, 16, 17] do not assume an a priori knowledge of the environmental dynamics (i.e., of the reward and state transition functions); as a result, the same learning algorithm deployed on different network scenarios can automatically discover differentiated transmitting policies, without any need of adaptation or tuning of the RL algorithm.
    Fig. 3

    Machine learning (ML) techniques utilized in the RL literature

  2. 2.

    Context adaptiveness. Through the concepts of rewards and Q-values, the RL framework provides effective building blocks in order to implement adaptive, spectrum-aware solutions. Indeed, since any RL agent continuously evaluates its current policy and improves it, any change in the received reward might cause a policy switch, or it might trigger new exploration actions, hence leading to the discovery of better actions to perform in some states. Moreover, the presence of aggregated rewards can indirectly boost the context-awareness in another way. As already said, performance of CR networks can be affected by multiple factors, whose interactions might be difficult to model analytically. Instead of addressing a single factor at a time, a RL agent can observe all the factors as a state, receive an aggregate feedback (e.g., the cost of each transmission), and optimize a general goal as a whole, e.g. throughput [28].

  3. 3.

    Reduced complexity. In most cases, RL techniques provide a simple yet effective modeling approach [12]. Model-free RL algorithms like Q-learning or Sarsa require only the storing of the Q-table. The number of state-action values can be further reduced via function approximation techniques; an example related to CR spectrum management can be found in [29]. In addition, it is worth remarking that the update rule of Q-learning or Sarsa algorithms can be implemented in few lines of codes. This feature makes RL techniques suitable also in resource-constrained environments, like CR-based sensor networks [30], where the wireless devices must face severe energy issues.

These advantages are counterbalanced by formidable drawbacks, which should be taken into account when working on CR networks, i.e.:
  1. 1.

    Continuous Discovery. Properly balancing the exploitation/exploration phase is a unique challenge of the RL framework [23]. On the one side, RL agents are required to perform random actions in order to explore the state-action space and then compute the optimal policy. In dynamic environments, the exploration phase cannot be ended after the boot phase; rather it must be continuously performed over time. This is the case, for instance, of SUs aimed to learn the available spectrum opportunities in a multiband scenario; while transmitting on a PU-free channel, the SU should also keep track of the opportunities on other channels, so that a spectrum handoff can be quickly performed in case of PU appearance [31]. On the other hand, a random selection might translate into suboptimal actions being executed, e.g., into the selection of low-quality or PU-busy channels, and hence lead to temporary performance degradation. Permanent performance degradation can occur when the exploration phase has been too short, or too long; hence, the optimal trade-off between exploration and exploitation can be complex to achieve, as investigated in section “Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks”.

  2. 2.

    Convergence Speed. Many RL techniques (specially time discounted methods [12]) guarantee convergence to the optimal policy only if each action is executed in each state an infinite number of times. This is clearly not realistic for most of CR applications; moreover, the fact that environmental conditions can quickly change over time can pose additional requirements on the speed of convergence. The issue is further exacerbated in MARL scenarios, where the optimal joint action must be determined, e.g., in spectrum selection or power adaptation problems [32, 33] where the SUs should maximize their own performance while collectively mitigating the interference to PUs. For these reasons, MARL-based algorithms are often enhanced with GT mechanisms which guarantee the emergence of a Nash equilibrium under specific assumptions [25, 26].


Reinforcement Learning in Cognitive Radio Scenarios: Applications-Driven Taxonomy

In this section, we describe the applications of RL techniques in four different CR use-cases, e.g., spectrum sensing, power allocation, spectrum decision, and routing. For each use-case, we provide a taxonomy of the existing works, and we briefly discuss their technical contributions, by mainly focusing on the problem formulation through the MDP. Figure 4 depicts the classification adopted throughout this section.
Fig. 4

Applications-driven taxonomy of the RL-CR literature

Spectrum Sensing

In CR, spectrum sensing techniques play the crucial role to identify the available spectrum resources for the SUs [1]. As a result, most of research is focused on advanced signal processing schemes aimed to achieve robust PU detection under different signal-to-noise ratio (SNR) conditions [3]. Beside this, the scheduling of sensing actions is also a crucial task affecting the performance of the SUs [34], mainly due to the fact that half-duplex radios cannot transmit on a channel while listening to it. The optimal sensing schedule can be determined via experiments and analytical models [4] or dynamically learnt via trial-and-error interactions with the environment [35]. About this latter, existing RL-based sensing schemes can be further classified into individual or cooperative approaches. We discuss them separately in the following.

Individual Sensing Scheduling

Frequency and duration of the sensing phase constitute a challenging trade-off between PU detection accuracy and throughput of the secondary network: too long sensing intervals might cause buffer overflow and/or TCP time-outs, while too short sensing intervals might lead to poor throughput due to SU-PU collisions [34]. Moreover, SUs should periodically explore all the available channels in order to detect spectrum opportunities over time. For these reasons, recent studies like [36, 37] and [38] investigate the problem on how to optimally balance sensing, transmission, and exploration actions, so that the performance of SU networks are maximized, while the PU detection accuracy is always kept higher than a safety threshold. More specifically, the authors of [36] formulate the problem via a MDP defined as follows: the set of states S = {s1, s2, ..s|S|} represents the available (licensed) resources. On each channel s, a SU can perform three actions:
  • a1: sense channel s and transmit in case the channel is found idle (exploitation).

  • a2: sense channel s′≠ s (exploration).

  • a3: switch to channel s′≠ s (exploitation).

For actions a1 and a2, the reward is expressed by the number of PU-free subcarriers detected on the sensed channel; vice versa, for action a3 the reward is always equal to zero. The study in [37] extends such formulation, by taking into account the channel switching delay in the reward of action a3 and by decoupling the transmit action from the sensing action; for sensing actions, the reward is equal to 1 in case of PU detection, 0 otherwise. Vice versa, the reward of transmit actions is computed as the average number of MAC retransmissions for each successful data transmission. The simulation results in [37] show that the proposed RL-based scheme is able to dynamically adjust the sensing frequency according to the perceived PU activity on each channel. Similarly, in [38], the authors aim to balance transmission and sensing actions on each channel; a cost function Cs(τ) is decreased each time a sensing action is performed on channel s, and this latter is found idle. When Cs(τ) is lower than a threshold Γ, then the SU can perform a transmission attempt; vice versa, if Cs(τ) > Γ, then SU defers its attempt and keeps sensing the channel. When the channel is found occupied by a PU, the cost function Cs(τ) is reset to a maximum value. In [39], the problem of determining the optimal sequence of channels sensed by each SU is formulated through the RL framework; here, a state is defined as an ordered couple < ok, fj > , where ok is the current position at the sensing order and fj is the k-th channel sensed in the current slot. At each state < ok, fj > , the list of available actions will include all the channels (not visited yet) which could be sensed at the next position of the sensing order (ok+1). The reward function for a specific sensing order action takes into account the time spent sensing the channels and the transmission rate experienced by the SU on the selected channels [39].

Cooperative Sensing Scheduling

Sensing techniques can be prone to errors in presence of shadowing or multipath fading conditions on the current licensed channel. For this reason, cooperative sensing techniques [3] aim to enhance the PU detection by aggregating channel measurements from multiple SUs and by averaging the gathered results. However, the network overhead might limit the cooperative gain: for instance, the transmission delay might be higher in presence of cooperative sensing, since each SU should gather the measurements from other peers before taking a decision about the spectrum availability. For this reason, studies like [40, 41] and [42] employ the RL framework in order to determine the optimal set of cooperating neighbors for each SU; the goal is to maximize the PU detection accuracy while avoiding unnecessary measurements sharing among correlated SUs. In [40], the set of states S for SU i coincides with the list of neighbors, plus one start and one end state. There is an action which allows to move from any couple of states; the sequence of actions correspond to the list of cooperative sensing neighbors, i.e., neighbors to query in order to get channel measurements. The reward function combines the amount of correlation among the gathered sensing samples plus the total reporting delay. In [42], the authors investigate how to coordinate the sensing actions of a secondary network, in order to meet the optimal trade-off between two goals: (i) the maximum number of spectrum opportunities is detected, and (ii) the probability of missed detection on each channel is kept below a safety threshold. Such probability is estimated based on the number of SUs currently sensing the channel. Since the SUs cannot directly observe the PU state on each channel, the sensing problem is formulated via a partially observable MDP (POMDP).

Power Allocation

In both underlay and overlay CR spectrum paradigms, the SUs should properly tune their transmitting power levels so that the probability of generating harmful interference to any active PU is minimized. Differently from spectrum sensing, which can be considered an individual or, in presence of knowledge sharing, a fully cooperative task, power allocation is a natively hybrid competitive/collaborative task, since the reward function, i.e., the aggregated interference perceived by PU receivers, depends on the joint action performed by the SUs, i.e., on the selected transmitting power level of each SU. For this reason, power allocation can be easily modeled via a MARL framework (see section “Overview of Reinforcement Learning”). A straightforward approach in order to determine the optimal power allocation consists in storing the complete MARL Q-table for each state/action/learning agent and by computing the optimal \(\widehat {a_t}\) through Eq. 4. This methodology is employed in [43], assuming a centralized CR network with a single learning agent, i.e., the cognitive base station, which is charge of determining the optimal power level of each SU, based on the cumulative interference caused to the PUs. In distributed deployments, storing and updating the complete MARL Q-table at each SU might not be practical especially when the number of learning agents (i.e., the SUs) increases. For this reason, most of recent works employ decentralized MARL with two different approaches. In the first case (described in section “RL-Based Power Allocation Based on Information Sharing”), the SUs share rewards or rows of their Q-table after each local action, so that the interference caused by the joint action \(\widehat {a_t}\) can be computed. In the other case (described in section “RL-Based Power Allocation Without Information Sharing”), each SU acts according to the local information only, but the secondary network still aims to achieve a global coordination, often expressed by the notion of Nash equilibrium (NE).

RL-Based Power Allocation Based on Information Sharing

In decentralized MARL frameworks, information sharing can be used for two different objectives: (i) speedup the learning process of the individual agents, by reducing the amount of exploration needed; or (ii) favor the identification of joint optimal actions at each learning agent. The docitive paradigm discussed in [44, 45] is an example of the first use-case; here, the learning agents are secondary base stations (femtocells) which must determine the optimal transmitting profile so that the aggregated interference at the primary receivers is kept below a specific threshold (SINRTh). The problem is modeled through the MDP defined as follows:
  • The state set is defined as the set of couples \(\{I_t^i,d_t^i\}\), where \(I_t^i \in \{0;1\}\) is a binary indicator specifying whether the CR base station i is generating an aggregated interference above or below the SINRTh threshold, and \(d_t^i\) is the approximated distance between i and the protection contour region of the primary system.

  • The action set coincides with the discrete set of power levels which can be assigned to each CR base station.

  • The reward function \(R(i)=(SINR_t^i - SINR_{Th})^2\) expresses the difference between the SINR value measured by SU i and the expected threshold.

In addition, docitive SUs can teach the discovered policies to other peers, by sharing parts of the local Q-table. In [44] , three different information-sharing techniques are evaluated, namely, start-up docition, adaptive docition, and iterative docition. This latter (involving continuous sharing of the Q-table entries) maximizes the system performance, although introducing the highest network overhead. In [45], the docitive scheme is extended for the case of partially observable environments, i.e., when the SUs lack of complete information about aggregated interference at the protection contour regions. A similar learning-from-experts MARL approach is followed in [46] and [47] but also introducing expertness measures which estimate the amount of knowledge which can be transferred between each couple of SUs. More specifically, the authors of [47] address a generic power-spectrum selection problem; at each step, an agent can decide whether to stay idle, to switch to a different channel, or to increase/decrease the current transmitting power. The current state reflects the buffer occupancy of each SU, while the reward function is related to the energy efficiency of each action. Periodically, all the SUs update their Q-entries by considering a weighted combination of location information and received Q-values, i.e., \(Q_i^{\mathrm {new}}=\sum _j W_{i,j} \cdot Q_j^{\mathrm {old}}\), where Wi,j is a measure on how much SU i relies on knowledge produced by SU j. The scheme in [47] employs information sharing in order to speed up the individual learning process; however, there is no guarantee that the optimal joint action will be determined. In order to fulfill this second requirement, the authors of [48] propose a cooperative RL-based power allocation scheme aimed to control the aggregated interference generated by SU femtocells. The MDP model is similar to [44, 45]; however , each SU shares only a row of its Q-table. At each time-slot, a SU chooses action ai maximizing the summation of the Q-values considering the current states of all the N neighbors, i.e.:
$$\displaystyle \begin{aligned} a_i=\mathrm{argmax}_{a} \left(\sum_{i\leq j \leq N} {Q_j(s_j,a)}\right) \end{aligned} $$

RL-Based Power Allocation Without Information Sharing

Differently from the previous works, studies like [32] and [33] address the power selection problem without assuming information sharing among the SUs; each SU tunes its transmitting power based on local (possibly inaccurate) interference measurements, but at the same time, it observes or makes conjectures about the behavior of other SUs. Based on such estimations, each SU adjusts its transmitting power so that a global NE is achieved, i.e., no SU can experience better rewards by following a different policy. The MDP is formulated as follows:
  • The state set is defined as the set of couples {Ii, pi(ai)}, where Ii ∈{0;1} is a binary indicator specifying whether the SINR of SU i is higher or lower than a predefined safety threshold and pi(ai) denotes the current power level.

  • The action set coincides with the discrete set of power levels which can be assigned to each SU.

  • The reward function R is a proxy for the energy efficiency of the transmission attempt, i.e., of the average number per bits received per unit of energy consumption; if Ii=1, the reward is set to zero.

Differently from [44, 45], each SU keeps internal conjunctures about how the other SUs will react to their current action. More specifically, at each action performed in state st, SU i updates its Q-table as follows:
$$\displaystyle \begin{aligned} Q^{t+1}(s_i,a_i)=(1-\alpha) \cdot Q^t(s_i,a_i) + \alpha \cdot \left( \sum_{a_{-i}} \overrightarrow{c}^t(s_i,a_{-i}) \cdot r_t + \beta \mathrm{max}_{b_i}Q(s^{\prime}_i,b_i)\right) \end{aligned} $$
where \(\overrightarrow {c}^t(s_i,a_{-i})\) denotes the conjecture of SU i regarding the behavior of the other players and is updated as follows:
$$\displaystyle \begin{aligned} \overrightarrow{c}^{t+1}(s_i,a_{-i})=\overrightarrow{c}^{t}(s_i,a_{-i}) - \omega_i^{s_i,a_{-i}}\cdot \left[ \pi_i^{t+1}(s_i,a_i) - \pi_i^{t}(s_i,a_i)\right] \end{aligned} $$

Spectrum Selection

Dynamic channel selection (DCS) constitutes the most investigated RL application in the field of wireless networking [20, 49, 50, 51, 52, 53]. In overlay RL networks, each SU must select the proper channel where to transmit in order to fulfill two main requirements: (i) minimize the interference caused to PU receivers tuned on the same or adjacent spectrum bands (G0) and (ii) maximize its own performance, by taking into account the channel contention and the MAC collisions caused by other SUs operating on the same band (G1). Moreover, the SUs should continuously execute channel selection in order to adapt to dynamic changes of the PU activities, to the traffic loads generated by the SUs, and to varying propagation and channel state conditions. It is easy to notice that the RL framework fits well the requirements of adaptive protocol design. G0 is usually addressed via the SARL techniques presented in section “SARL-Based DCS”. Vice versa, meeting both G0 and G1 requires some form of coordination among the SUs: for this reason, the DCS problem is modeled via MARL techniques enhanced with game theory concepts, so that a stable channel allocation is achieved (details are provided in section “MARL-Based DCS”). Another way of classifying the existing RL-DCS schemes proposed in the literature is by focusing on the learning agent, i.e., on where the RL framework is implemented. The solutions presented in sections “SARL-Based DCS” and “MARL-Based DCS” refer to a scenario where channel selection is performed by each SU, and the PU is unaware about the presence of opportunistic users. Vice versa, in spectrum trading models, the PUs borrow portions of its spectrum to the SUs, receiving in return a monetary revenue; the problem formulation through the RL framework allows determining the optimal portion of spectrum band which can be leased to the SUs without compromising the QoS requirements of the primary network. Details about RL-based spectrum trading schemes are provided in section “RL-Based Spectrum Trading”.


This subcategory includes all the works where a SU learns in isolation the optimal sequence of channels where to transmit, without receiving any explicit feedback from other SUs and without keeping any implicit model of the opponents behaviors. At the same time, the reward function is often modeled in order to reflect some network performance (e.g., throughput or delay metrics) which are also affected by the joint strategy, i.e., by the channel selection performed by the other SUs. While this approach greatly simplifies the problem formulation, it might introduce some oscillating behaviors when also goal G1 is taken into account: a SU might keep adjusting its operating channel as a consequence of channel selection performed by the other SUs. SARL-based DCS schemes can be further classified as state-full or stateless approaches. In the first case, the RL framework contains both actions and states, hence following the traditional structure discussed so far. Examples of state-full SARL-based DCS schemes are presented in [20] and [49]. More specifically, in [49] the authors propose an opportunistic spectrum model, in which each SU is associated to a home band (where it has the right to transmit), but it may also seek for spectrum opportunities in the licensed bands (at condition of minimizing the interference caused to the licensed users). The DCS problem is modeled through the following MDP:
  • The state set \(S=\{s^i_0,s^i_1,\ldots ,s^i_M\}\) coincides with the list of available channels; \(s^i_0\) corresponds to the home channel of the SU, while \(s^i_1,\ldots ,s^i_M\) are the licensed channels.

  • The action set A = a0, a1, …aP indicates the output of the channel selection process; a0 is the action of transmitting on the home channel, while action ai, i>0 perform explorations, i.e., the SU will transmit on the M licensed frequency by following a specific channel sequence.

  • The reward R is a function of the quality communication level, which can be determined via link-metrics (e.g., SNR or packet success rate).

The simulation results in [49] show that the RL-based DCS scheme can greatly enhance the performance compared to random access schemes, also in presence of PU load variability. In [20], the authors propose a joint RL-based DCS and congestion control scheme, which performs channel selection by taking into account the traffic load produced by each SU and the amount of PU activity on each band. This is achieved by enriching the definition of states of the RL framework; each state is a combination of four variables, which models, respectively , the amount of required bandwidth, the current data packet dropping probability, the amount of good white space in the current channel, and the amount of good white space across the various channels. Stateless SARL-based DCS schemes simplify the RL framework, by eliding the states, and considering only the action set A which often coincides with the list of available channels; executing action ai corresponds to switching to frequency fi, sensing it, and transmitting in case no PU activity is detected. In [51], the Q-learning update rule of Eq. 4 is simplified to:
$$\displaystyle \begin{aligned} Q^i_{t+1}(a)=(1-\alpha)\cdot Q^i_{t}(a) + \alpha \cdot r^i_{t} \end{aligned} $$
Here, the reward \(r^i_{t}\) is the throughput experienced by SU i at time-slot t. In order to avoid oscillations in the learning process, sequential exploration is employed, i.e., a single SU can undergo exploration within a neighborhood. In [30], the authors propose three different RL-based DCS schemes, all based on the update rule of Eq. 10 but adopting three different formulation of the reward function, i.e.: the transmission successful rate in each epoch (named Q-learning+ scheme), the SINR metric (named Q-Noise scheme), and the SINR plus the historical behavior of the SUs (named Q-Noise+). A similar approach is also followed in [54], where the SUs aim to learn the optimal channel selection probability and the amount of PU activity on each channel. It is also worth noting that stateless RL frameworks can be considered instances of the multiarmed bandit (MAB) problem [55]. Several MAB-based DCS algorithms have been proposed in the literature. We cite, among others, the study in [56], where the authors compare two popular MAB schemes, named the UCB and the WD techniques, to the DCS problem in RL scenarios, assuming error-free sensing and that the temporal occupation of each channel follows a Bernoulli distribution. The output of the learning process is hence to learn the PU channel occupation probability of each channel, limiting the summation of regret2 over time. The MAB framework of [56] is extended in [57] by considering cooperation techniques, aimed to improve the sensing accuracy, and coordination techniques, aimed to mitigate the impact of secondary interference.


In [52] and [53], the authors formulate the DCS problem through a MARL framework, by extending the previous SARL formulation [51]; the goal of the SUs is to discover the optimal joint action addressing both G0 and G1 requirements. This is performed via a payoff propagation mechanism, i.e., each SU i maintains – in addition to the Q table – a μ-table with size |Γ(i) × A| where Γ(i) denotes its set of neighbors and A is the set of actions, which coincides with the channel list. Each time SU plays action ak (i.e., switches to channel ak), it transmits a payoff message including its \(Q^i_{t+1}(a_k)\) value, while all the other SUs j ∈ Γ(i) will store such value in their μ-table. When selecting the next channel \(\widehat {a}_{t+1}\), SU i will take into account both the local Q-table as well as the payoff table, i.e.:
$$\displaystyle \begin{aligned} \widehat{a}_{t+1}=\mathrm{argmax}_{a \in A} | Q^i_t(a) + \sum_{j \in \varGamma(i)}{\mu_{ji}(a)}| \end{aligned} $$
The simulation results show that the MARL-based and SARL-based DCS schemes provide similar level of throughput, although the MARL-based scheme greatly reduces the number of channel switching operations. The converge of MARL-based DCS schemes to a Nash equilibrium (NE) is investigated in [25] and [58]. More specifically, in [58] the authors consider a simplified SU interference model where maximum one SU can operate on each channel and demonstrate that a Q-learning-based DCS scheme without any SU cooperation and regardless of the initial allocation can converge to a stationary channel allocation. The result holds only under the assumption that all the SUs have perfect knowledge of the complete system state, i.e., of the PU occupancy of all the available channels. In [25], the authors remove such assumption and propose a probabilistic DCS scheme which is demonstrated to converge to a NE. To this purpose, each SU updates the selection probability pt+1(k) at each transmission attempt on channel k according to a linear reward-inaction model, i.e.:
$$\displaystyle \begin{aligned} p_{t+1}(k)=p_{t}(i) + r_t \cdot (e_k - p_{t}(k)) \end{aligned} $$
where rt is a function of the SINR metric perceived by the SU receiver, and ek is the unit vector [25].

RL-Based Spectrum Trading

Spectrum trading can be considered as a variant of DCS problems where spectrum operations involve both the PU, who is in charge of deciding the amount of frequencies to borrow to the SUs, and SUs, who can request specific portions of the spectrum. In [59], the authors propose a RL-based scheme which helps a PU in deciding which requests to accept and which to reject, assuming that SUs belong to different classes, mapped on different QoS requirements. The MDP formulation is as follows:
  • the state set S coincides with the number of SU traffic classes; the value of si is the number of SU requests accepted belonging to class i.

  • the action set A = {0, 1} includes only two choices, corresponding to the option of accepting a new incoming request or to refuse it.

  • the reward function R = P − C combines the expected monetary profit (P) that should be paid by the SU with the cost C, which is proportional to the number of already leased channels.

In addition, the authors of [59] show how to dynamically adjust the spectrum price and the size of the spectrum leased over time, based on the dynamic SU traffic load conditions. In [60], the RL-based spectrum leasing problem is inverted, i.e., the SUs learn to improve the bidding policy in the spectrum auction game, by using the transmission capability of each channel as immediate reward.

Spectrum-Aware Routing

In multi-hop wireless networks, routing protocols are in charge of discovering a feasible path between any source and a destination nodes; the path creation is performed through a distributed node selection process guided by end-to-end/global (e.g., delay) or link-by-link/local (e.g., SNR) metrics. In CR networks, routing protocols must address additional challenges caused by the dynamic variation of spectrum opportunities, like (i) the need of selecting forwarding nodes, so that the interference caused to PU receivers is minimized; (ii) the need of fast rerouting mechanisms, so that alternative, back-up paths can be used when the main path is invalidated due to the appearance of a PU [61]. For this reason, several existing routing schemes for CR networks address joint node and frequency selection [61, 62]. The routing problem can be easily modeled in the RL framework: each CR node must learn the optimal next-hop toward the destination via trial-and-error interaction. At each data transmission, the SU receives a reward which is a proxy for the forwarding cost, like the mean-access delay or the amount of energy consumed. Changes in the environment, like the appearance of a PU or the SUs mobility, are reflected in changes of the received feedbacks, which in turn translate into the selection of an alternative path. The MDP process for a generic RL-enabled routing protocol can be described as follows:
  • The state set S coincides with the set of SU nodes NSU in the network.

  • The action set Ai is defined for each node i ∈ NSU; more specifically, \(a_{j}^{(s,d)} \in A_{i}\) denotes the action of forwarding data toward next-hop j, where s and d are, respectively, the source and destination communication end-points.

  • The reward \(R(i,a_{j}^{(s,d)})\) is a network metric reflecting the effectiveness for node i of using j as next-hop node toward the destination d.

The above model has been implemented by Q-routing [63], which is a popular routing protocol for dynamically changing networks, also applied over generic multi-hop wireless ad hoc networks [64]. In Q-routing, each node i maintains a table of Q-entries for each destination d; the entry Qi(j, d) is the expected delivery time toward destination d when using next-hop node j. After forwarding a packet via node j toward destination d, node i updates its Q-Table as follows [63]:
$$\displaystyle \begin{aligned} Q_{i}(j,d)= q_{i}+\delta+min_{z}Q_{d}(y,z) \end{aligned} $$
where qi is the time spent by the packet in the queue of node i, δ is the transmission delay on the i − j link, and minzQj(z, d) is the best delivery time at node j and for destination d. The same learning framework than Q-routing has also been adopted by several CR routing protocols, like [21, 65] and [66] although properly adapting the reward function to the CR scenario. In [65], the reward \(R(i,a_{j}^{(s,d)})\) is the per-link delay, which also takes into account the retransmissions caused by SU-PU collisions. In [21], the reward function is an estimation of the average channel available time, i.e., average OFF period of PUs interfering over the bottleneck link along the route from j to d. In addition, the authors of [21] investigate the performance of the proposed RL routing protocol on a real test-bed environment using USRP platforms; the experimental result demonstrates that the RL scheme provides better result than a greedy approach in terms of end-to-end metrics (i.e., throughput and packet delivery ratio). A multi-objective Q-routing scheme for CR networks is discussed in [66]; more specifically, the proposed algorithm aims to minimize the packet loss rate under desired constraint of transmission delay. The multi-objective is implemented by employing two rewards for each successful transmission (e.g., loss rate and delay) and by storing two separate Q-values at each node. The authors of [28] propose two RL-based spectrum-aware routing protocols for CR networks. Here, the Qi(j, d) value denotes the number of available PU-free channels on the route from SU i to SU d via SU j; the SU j providing the highest Q-value is the preferred next-hop candidate. The Q-values are updated after each successful transmission, using a dual RL algorithm. In [67], the RL framework is used in order to properly tune the transmitting parameters of the popular AODV routing protocol. A different RL formulation for CR routing is proposed in [68], considering both the delay minimization requirement of each SU-SU flow and the interference minimization requirement of each SU-PU link and assuming no cooperation occurs among the SUs. The MDP is defined as follows:
  • The state set of SU i is defined as the tuple: < ηi(t), λ1(t), λ2(t), …, λ|PU|(t) > , where ηi(t) is the packet arrival rate of SU i and λx(t) is the packet transmission rate of PU x at time t.

  • The action set of SU \(iA\,{=}\,\{a_0,NH_i^1,NH_i^2,..,NH_i^k,\}\) includes the no-forwarding action a0 and the transmission toward next-hop node \(NH_i^j\).

  • The reward function is equal to the delay experienced by packets flowing from SU i to the destination node, in case the interference caused to PUs is kept below a safety threshold; it is set to a large value in case the no-forwarding action is selected or in case the interference caused to the PUs is higher than the threshold.

Since no information sharing is assumed, each SU forms conjectures on the other SUs routing strategies by using local observations of the environment; the convergence to proposed routing protocol is also investigated in [68] through an analytical framework and network simulations.

Reinforcement Learning in Cognitive Radio Scenarios: Learning Methodology-Driven Taxonomy

While the aim of the previous section was to analyze the existing literature from a networking perspective, in this section we provide an alternative review from the learning perspective, i.e., on the way different components of the RL framework are modeled for CR-related issues and on the overall evaluation method adopted. To this purpose, we decompose the learning process into six steps, which are depicted in Fig. 5 and discussed separately in the following:
  • State representation. The state of SUs is often modeled through a single discrete variable or a combination of discrete variables. However, the reviewed RL-CR studies differentiate on whether the state variables are fully observable by the SU or are only partially observable. In the first case, the state variable is expressed by parameters which are internal to the SU or by network conditions which can be measured by the DSA without perception errors. This is the case, for instance, of the MDP model proposed in [44], where each state takes into account both internal (i.e., the current distance from the PU) and network metrics (i.e., the aggregated interference caused by the secondary network). Vice versa, a minority of the cited studies takes into account the impact of perception errors on the network observation: we cite, for instance, the MDP proposed in [41] and [42] where the SU state is the belief that a given frequency is vacant, hence subject to the accuracy of the sensing scheme.
    Fig. 5

    Application-driven taxonomy of the existing RL-CR literature

  • Model representation. Almost the totality of the proposed RL-CR solutions employs model-free strategies with very few exceptions [68, 69], i.e., the agent does not keep any representation/estimation of the state transition and of the reward functions (the T and R functions in section “Overview of Reinforcement Learning”); rather, it updates the Q-table after each immediate reward through the popular Q-learning or Sarsa algorithms. This choice can be justified since, on several use-cases like DSA problems, the reward values are associated to network metrics (e.g., the actual throughput or the SNR) which are stochastically by nature and whose trends are hard to predict without having full knowledge of the network and channel conditions; moreover, both the state transition and the reward functions can dynamically vary in nonstationary environments due, e.g., to SU or PU mobility. For this reason, most of the works prefer to adjust the policy as a blind consequence of the received reward, instead of attempting to unveil the rules behind it. Some foundational results on this topic are provided in [70], where the authors investigate the relationship between the learning capabilities of the SUs in RL-DCS applications and the complexity of the PU pattern activity, measured through the Ziv-Lempel complexity metric. The experimental results demonstrate that, for specific levels of Ziv-Lempel complexity, the PU spectrum occupancy pattern can be learnt in an effective way by the SUs, hence justifying the utilization of model-based solutions.

  • Reward representation. The modeling of the reward function clearly depends on the specific CR use-case. However, we can distinguish between two main approaches: absolute representation and communication-aware representation. In the first case, the reward is a scalar value, which can assume positive or negative values in order to encourage good actions or to penalize bad actions, but it is not related to any network metric. This is the case, for instance, of the RL-DCS scheme proposed in [29], where different rewards are introduced according to the outcome of each SU-SU transmission (i.e., successful, failed due to CRPU interference, failed due to CR-CR interference, failed due to channel errors). Vice versa, in communication-aware scheme, the reward takes into account node-related (e.g., the energy efficiency in [47]), link-related (e.g. the SNR in [43]) or network-related (e.g., the throughput in [20]) metrics. The clear advantage of this second approach is that the Q-table will converge over time to the actual system performance for the selected metric; at the same time, this might introduce additional protocol complexity, especially in presence of aggregated or cumulative metrics (e.g., the end-to-end path delay in Q-routing [63]).

  • Action Selection. Strategies for action selection play a crucial role since they are in charge of balancing the exploration/exploitation phases, which in turn affect the performance of the RL-based solutions. Two main strategies have been employed in the RL-CR literature reviewed so far: the Boltzmann rule, which is based on Eq. 6 and relies on the temperature parameter TE in order to balance the exploration/exploitation phases, or the ε-greedy rule, which selects the optimal action with ε probability, while it performs random actions with 1 − ε probability. Both such strategies might guarantee adaptiveness to nonstationary environments; however, the way the TE and ε parameters are set and discounted over time is barely addressed, except for [23, 24]. More in details, [24] proposes an interesting self-evaluation mechanism which is added to a basic RL-DSA framework: each time the SU receives a predecided number of negative rewards in exploitation mode, it assumes that there has been a change in the environment and reacts by forcing an aggressive channel exploration phase.

  • Knowledge Sharing. In several CR use-cases modeled through a MARL, the SUs can share learnt information in order to speed up the exploration phase or to implement distributed coordination mechanisms. From the analysis presented in section “Reinforcement Learning in Cognitive Radio Scenarios: Applications-Driven Taxonomy”, we can further classify the existing MARL-CR works in three major families: (i) no-sharing, (ii) reward-based, or (iii) Q-table-based. The first category includes all works where the SUs update their Q-table independently and without receiving any feedback from the other peers, although the instantaneous reward might depend from the joint action executed by the secondary network (e.g., the throughput in [30]). We include in this group also centralized approaches, where the global Q-table is managed by a network coordinator (e.g., the cognitive base station in [43]), or solutions where each SU keeps conjectures about the future behavior of the other SUs [32, 68]. The second category includes approaches like the docitive [44, 45] or payoff propagation paradigms [52, 52] where the SUs share the immediate rewards or rows of the Q-table. The received data are then merged with the local data, by using expertness measures controlling the knowledge transfer [46, 47] or action selection methods for achieving distributed coordination [48]. The impact of knowledge sharing on MARL-DCS problems is further investigated in section “Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks”.

  • Evaluation method. Performance of RL-based solutions can be investigated through simulation studies, testbeds, or analytical models. The first two methods allow understanding the network performance gain introduced by RL techniques compared to non-learning approaches: at the best of our knowledge, [21, 71] and [72] are the only experimental studies in the literature. More specifically, [21] investigates the ability of a RL-enhanced routing protocol to select PU-free routes on a network environment consisting of ten USRP SU nodes, while [71] and [72] implement a RL-based DCS algorithm, respectively, over GNU-radio and USRP N210 platforms, and evaluate the way CR devices are able to learn the PU spectrum occupancy patterns. Both [21] and [71] confirm the effectiveness of RL-based solutions compared to state-of-the-art (non-ML-based) approaches. Analytical studies like [58] investigate the convergence of proposed RL algorithms to the optimal solution. Such theoretical results can be considered highly relevant from a pure scientific perspective but less practical in real-world network deployments, since the convergence property is assumed asymptotic and without accounting for the impact of exploration phase on the short-term system performance.

Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks

In this section, we describe an application of RL techniques to CR networks, in order to highlight gains and drawbacks of different RL algorithms and also to investigate the impact of learning parameters on the system performance. By referring to the taxonomy presented in section “Reinforcement Learning in Cognitive Radio Scenarios: Applications-Driven Taxonomy”, we consider here a joint spectrum sensing/selection (JSS) problem, in which a SU must learn the optimal channel where to transmit among the available frequencies and also the optimal balance between sensing and transmit actions on each channel. In section “System Model”, we introduce the system model and the problem goals. The problem is formulated by using the RL framework in section “RL-Based Problem Formulation”. Then, we evaluate the performance of RL-based solutions by neglecting the impact of secondary interference (section “Analysis I: SU-PU Interference Only”). Such assumption is removed in section “Analysis II: SU-PU and SU-SU Interference”.

System Model

We model a generic network scenario composed by N couples of SUs operating within the same sensing domain. Each SU is equipped with a DSA transceiver, able to switch over K frequencies of the licensed band, and over a common control channel (CCC) implemented in the unlicensed band. Each couple i is formed of one SU transmitter (\(SU_{i}^{tx}\)) and one SU receiver (\(SU_{i}^{rx}\)). Data packets are transmitted over a licensed channel, while the signaling traffic is transmitted over the CCC. On each frequency fj, there is an active PU which transmits according to an exponential ON/OFF distribution with parameters < αj, βj > . Hence, frequency j is vacant with a posteriori probability equal to \(\frac {\alpha _j}{\alpha _j+ \beta _j}\), while it is occupied by the PU with probability equal to \(\frac {\beta _j}{\alpha _j+ \beta _j}\). In addition, we model the packet error rate (PER) on each channel; let φj be the PER of channel fj. Each \(SU_{i}^{tx}\) can implement three different time-slots:
  • Sensing slot, i.e., \(SU_{i}^{tx}\) senses the frequency to which it is currently tuned, in order to determine the PU presence. The sensing length is equal to tslot. We assume a default energy-detection sensing scheme [3]: let pD indicate the probability of correct detection and 1 − pD the probability of sensing errors (including both false-positive and true negative instances).

  • Transmit slot, i.e., \(SU_{i}^{tx}\) attempts transmitting exactly one packet to \(SU_{i}^{rx}\) by using a CSMA MAC scheme. In case the MAC ACK frame is not received, \(SU_{i}^{tx}\) retransmits the packet till a maximum number of attempts equal to MAX_ATTEMPTS. Otherwise, the packet is discarded.

  • Switch slot, i.e., \(SU_{i}^{tx}\) switches to a different licensed frequency and communicates the new channel to the \(SU_{i}^{rx}\) on the CCC. Let tswitch represent the time overhead required for the handover.

Figure 6 shows an example of the time-slot sequences for three different SU transmitters. We indicate with \(\tau _i^k\) the type of k-th time-slot implemented by SU i, where \(\tau _i^k=\{\)SENSE, TRANSMIT, SWITCH} and with \(T_i=\{\tau _i^0, \tau _i^1, \ldots \}\) the slot schedule of SU i. Each SU i can decide its own schedule Ti, but subject to these constraints: (i) if \(\tau _i^k\)=SENSE, and the channel is found busy, then \(\tau _i^{k+1}\)=SENSE or \(\tau _i^{k+1}\)=SWITCH, i.e., the SU can keep sensing or switching to a different channel, but it cannot perform a transmission and (ii) if \(\tau _i^k\)=SWITCH, then \(\tau _i^{k+1}\)=SENSE, i.e., the SU must sense the new channel in order to discover its availability. Similarly, let NTXi be the total number of transmissions performed by SU i (including the retransmissions). We denote with STXi(l) the outcome of the l −th transmission (with 0 ≤ l < NTXi) performed by SU i. Based on the channel conditions, and on the SUs and PUs activities, the STXi(l) variable can assume one of these four values: (i) STXi(l)=OK if the transmission has been acknowledged by \(SU_{i}^{rx}\), (ii) STXi(l)=FAIL-PU-COLLISION if the transmission has failed due to collision with an active PU (i.e., PU is ON during the SU transmission); (iii) STXi(l)=FAIL-SU-COLLISION if the transmission has failed due to collision with other SU transmissions on the same channel; and (iv) STXi(l)=FAIL-CHERROR if the transmission has failed due to channel errors.The JSS problem can be formulated as the problem of determining the optimal schedule Ti of each SU i, 0 ≤ i < N, so that the total number of successful transmissions is maximized, while the probability to interfere with the PUs is kept below a predefined threshold (ψ). More formally:
Fig. 6

Example of time-slot sequences for three different SU transmitters

(JSS Problem) Determine the optimal schedule Tii, 0 ≤ i < N, so that:
  • 0≤i<N,0≤l<NTX(i)I(STXi(l) = OK) is maximized;

  • \(\frac {\sum _{0 \leq i < N, 0 \leq l < NTX(i)} I(STX_i(l)=\mathtt {FAIL-PU-COLLISION})}{\sum _{0 \leq i < N}{NTX(i)}} > \psi \), where I(⋅) is the identity function.

RL-Based Problem Formulation

We model the JSS Problem via a SARL model; each \(SU_{i}^{tx}\) is a learning agent. Figure 7 depicts the corresponding MDP for the case of K=2. More in details:
  • The set of states S is the set of couples < fj, {IDLE, BUSY, UNKNOWN} >  where the first field is the frequency fj ∈ K and the second field is the estimated availability of frequency fj, based on the output of the sensing action.
    Fig. 7

    The Markov decision process (MDP) for the JSS problem

  • The set of actions A = {SENSE, TRANSMIT, SWITCH} coincides with the slot types previously introduced.

  • The reward function R : S × A → [0 : 1] is defined in different ways according to the action implemented by \(SU_{i}^{tx}\). More specifically, R(< fj, ⋅ >, SENSE) is set to 1 whether channel fj is found BUSY, 0 otherwise. In case of transmit action, R(< fj, IDLE >, TRANSMIT) is set as follows:
    $$\displaystyle \begin{aligned} R(<f_j, \mathtt{IDLE}>,\mathtt{TRANSMIT})= 1 - \frac{\#Retransmissions}{\mathtt{MAX\_ATTEMPTS}} \end{aligned} $$

    where #Retransmissions denotes the number of retransmissions performed. Hence, the reward is set to 1 whether the packet is acknowledged without any retransmission. Vice versa, it is set to 0 whether the packet is discarded since the maximum number of retransmission attempts has been reached. Finally, in case of channel switch, the reward R(< fj, UNKNOWN >, SWITCH) is set to zero.

  • The transition function T : S × A × S → [0 : 1] is defined as follows, i.e.:
    $$\displaystyle \begin{aligned} \begin{array}{rcl}&\displaystyle T(<f_j, \cdot>,\mathtt{SENSE},<f_j, \mathtt{IDLE}>)= \frac{\beta_j}{\alpha_j+ \beta_j} \\ &\displaystyle T(<f_j, \cdot>,\mathtt{SENSE},<f_j, \mathtt{BUSY}>)= \frac{\alpha_j}{\alpha_j+ \beta_j} \\ &\displaystyle T(<f_j, \mathtt{IDLE}>,\mathtt{TRANSMIT},<f_j, \mathtt{TRANSMIT}>)=1\\ &\displaystyle T(<f_j, \cdot>,\mathtt{SWITCH},<f_k, \mathtt{UNKNOWN}>)=1 \\ &\displaystyle T(<f_j, \mathtt{UNKNOWN}>,\mathtt{SENSE},<f_j, \mathtt{IDLE}>)=\frac{\beta_j}{\alpha_j+ \beta_j} \\ &\displaystyle T(<f_j, \mathtt{UNKNOWN}>,\mathtt{SENSE},<f_j, \mathtt{BUSY}>)=\frac{\alpha_j}{\alpha_j+ \beta_j} \end{array} \end{aligned} $$

    For all the other input values, the transition function assumes output value equal to 0. In the equations above, we neglect the impact of channel sensing errors, and we assume that the next channel fk has already been determined. In any case, the T matrix is interesting only from the theoretical side, since in practice we assume that the SUs do not know its values.

We consider three RL-based learning algorithms addressing the JSS problem, i.e.:
  • Q-Learning based: each \(SU_{i}^{tx}\) stores a Q-table for each state/action and updates it after each TRANSMIT or SENSE action through Eq. 4. Moreover, at the end of slot k, \(SU_{i}^{tx}\) decides the next action through a probabilistic scheme. The probability of TRANSMIT or SENSE actions is set through Eq. 6, while the probability of a SWITCH action is computed as follows:
    $$\displaystyle \begin{aligned} \begin{array}{rcl} p(<f_j,\cdot>,\mathtt{SWITCH})&\displaystyle =&\displaystyle \mathrm{max} \{ \mathrm{max}_{0 \leq v < K, v \neq j} Q(<f_v, \mathtt{IDLE}>, \mathtt{TRANSMIT})\\ &\displaystyle -&\displaystyle Q(<f_j,\mathtt{IDLE}\ >, \mathtt{TRANSMIT}) , \theta \} \end{array} \end{aligned} $$
    Here, the first term of the max operator denotes the maximum gain achievable when switching to a channel different from the current one (fj), while the 0 ≤ θ ≤ 1 parameter indicates the probability of spectrum exploration. In case a SWITCH action is implemented, another probabilistic step is executed in order to select the channel: with probability θ, a random value is selected in range {0…K}; otherwise, the best channel is selected (the one equal to argmax0≤v<K,vjQ(< fv, IDLE >, TRANSMIT) − Q(< fj, IDLE >, TRANSMIT). The values of the temperature TE (see Eq. 6) and of γ are set to a large initial value, and then progressively discounted at each slot in order to ensure convergence, but they cannot decrease below predefined minimum TEmin and θmin values. We investigate the impact of the initial temperature value TE in section “Analysis I: SU-PU Interference Only”.
  • Sarsa based: the scheme works similarly to the Q-learning except for the update rule of the TRANSMIT and SENSE actions, which is based on Eq. 5.

  • Information Dissemination Q-Learning based (IDQ-Learning): the scheme works similarly to the Q-learning. In addition, each \(SU_{i}^{tx}\) shares the information about its state, action, and received reward, at each slot. All the \(SU_{j}^{tx}\), j ≠ i update their Q-values consequently as if the action was performed locally.

Analysis I: SU-PU Interference Only

We modeled the CR network scenario and the RL-based algorithms through the NS2-CRAHN simulator described in [34]. Unless stated otherwise, we considered a scenario composed of 20 SU couples (i.e., N=20) and K= 6 licensed channels. The other parameters are pD=10%, MAX_ATTEMPTS=7, TEini=50, TEmin=5, θini = 80%, and θmin = 10%. Each of the six channels exhibits different PU activity levels (PUL) and PER values, as reported in the table below.

We consider a constant bit rate (CBR) application; each \(SU_{i}^{tx}\) generates a new packet destined to \(SU_{i}^{rx}\) every 0.005 seconds. The packet length is 1000 bytes.

In this analysis, we assume that each SU does not interfere with other SUs tuned to the same channel. Hence, the goal of the learning algorithm is to identify vacant spectrum opportunities over the K channels. We compare the performance of the three RL-based schemes described in section “RL-Based Problem Formulation” with those of a non-learning scheme, named Sequential in the following. The protocol operations of the sequential scheme are straightforward: each SU senses the channel before any transmission attempt; in case the current channel fj is detected as busy, the SU switches to channel f(j+1)%K; otherwise, it transmits one packet and then senses the channel again.

Figure 8a shows the system throughput when varying the number of transmitting SUs (N) on the x-axis. It is easy to notice that all the RL-based algorithms greatly outperform the sequential scheme. No significant differences can be appreciated between Q-learning and Sarsa. Vice versa, IDQ-learning provides the highest throughput, and the gain produced by the cooperation becomes more evident when increasing the number of involved SUs. This result can be justified as follows: (i) the RL-based schemes estimate the quality of each channel and then concentrate most of the SU transmissions on channels 5, 2, and 4 characterized by favorable PUL and PER values and (ii) on these channels, the RL-based schemes reduce the amount of sensing actions while still guaranteeing satisfactory PU detection (see next results). In addition, compared to Q-learning and Sarsa, IDQ-learning guarantees better exploration and quicker convergence of all the SUs to the optimal state-action policy (which is equal for all the SUs). This is made evident in Fig. 8b, which shows the throughout over time, for a network scenario with N=20. Till second 100, both Q-learning and Sarsa perform poorly because they are performing exploration, caused by the high values of the TE and θ parameters. After second 100, the throughput of both schemes sharply increases because they exploit more aggressively the learnt policy. In IDQ-learning, the impact of random actions is greatly mitigated since the exploration phase is shorter: at each round, each SU can receive N different rewards and hence discounts more quickly the TE and θ parameters. At the same time, the exploration phase is more effective since all the SUs converge to the same policy guaranteeing the highest throughput. Figure 8c confirms the same trend of Fig. 8a, by showing the packet delivery ratio (PDR) of the four schemes for different values of N. The PDR of the sequential, Q-learning and Sarsa schemes are not affected by N since – in this analysis – we are neglecting the impact of SU channel contention. The PDR of the IDQ-learning increases with N, again due to the positive impact of the SU cooperation.
Fig. 8

The throughput and packet delivery ratio (PDR) of the sequential and of the RL-based schemes are shown in (a) and (c), respectively. The throughput over simulation time for N=20 is shown in (b)

Figure 9a reveals that the PDR and throughput enhancements do not come at the expense of increased interference caused to the PUs. On the y axis, we show the PU interference probability, defined as the rate of SU transmissions ending in state FAIL-PU-COLLISION over the total number of transmissions performed by the SUs. The Q-learning and Sarsa schemes guarantee a value which is comparable with the performance of the sequential scheme and in any case lower than 2%. The IDQ-learning exhibits a counterintuitive behavior: the risk of interference with PUs even reduces when increasing the number of potential interferers (i.e., the SUs), again thanks to the reward dissemination mechanisms, through which all the SUs converge to the optimal channel sequence and to the optimal balance between SENSE and TRANSMIT actions on each channel. To this purpose, Fig. 9b shows the average frequency rate of each action (different color bars) and on each channel (on the x axis), experienced by the IDQ-learning (N=20). It is easy to notice that our learning scheme (i) concentrates most of transmissions on channels 5 and 2 (which are the one most favorable toward the PUs in terms of PUL and PER values) and (ii) significantly reduces the frequency of sensing actions on channel 5, while it maximizes sensing on channels 0 and 3 (characterized by high PU activity). Hence, the graph confirms the ability of the IDQ-learning scheme to learn the optimal sequence of spectrum opportunities, and the amount of sensing on each channel, without knowing in advance the PER and PUL values. Finally, in Fig. 9c we investigate the impact of the initial temperature value (TEini) on the throughput (on the y1 axis) and on the number of channel switches (on the y2 axis). Again, we evaluate the IDQ-learning scheme with N = 20. We can notice that there is an optimal TEini value (equal to 10 in our case) maximizing the throughput: when TEini < 10, the exploration phase is too short and hence the optimal policy cannot be discovered, vice versa when TEini ≫ 10, the impact of suboptimal actions during exploration becomes significant. On the other side, the number of channel switches increases proportionally with TEini.
Fig. 9

The PU interference probability over N is shown in (a). The selection rate of the three actions (TRANSMIT, SENSE, and SWITCH) on each of the K channels and for the IDQ-learning scheme is depicted in (b). The impact of the initial temperature value (TEini) on the throughput and on the number of channel switches is shown in (c)

Analysis II: SU-PU and SU-SU Interference

In this section, we complete the analysis by considering also the impact of SU-SU interference on the system performance. In addition to the channel model presented so far, a SU transmission might result in state FAIL-SU-COLLISION, i.e., it can fail due to collisions with other SU transmissions operating on the same channel. Such collisions can occur despite the utilization of a MAC protocol providing distributed channel contention orchestration mechanisms (we assume all the SUs follow a CSMA/CA protocol). As a result, SUs cannot merely implement all the same policy; besides determining the optimal balance of sensing actions, SUs must also learn the optimal coordinated action, which leads to the optimal allocation of the SUs to the available frequency. The goal is a hybrid collaborative/competitive MARL problem; no solutions could be found in the literature for this specific JSS problem instance, although close problems have been modeled via game theory-based approaches and distributed collaboration techniques (e.g., the payoff propagation mechanism described in [52, 53]). Far from determining the optimal solution, we introduce here an addition RL-based scheme, named distributed Q-learning, which is based on the frequency maximum Q-value (FMQ) heuristic [27]. This latter attempts to achieve SU coordination without any explicit policy exchange and also with minimal extra-storage requirements compared to the basic SARL techniques. We present the protocol operations in brief:
  • Distributed Q-Learning (DistQ-Learning): the protocol works similarly to the Q-learning scheme described in section “RL-Based Problem Formulation”, with two significant differences. First, each time \(SU_{i}^{tx}\) performs a TRANSMIT action on a given channel; it computes a local reward \(r_i^L=1 - \frac {\#Retransmissions}{\mathtt {MAX\_ATTEMPTS}}\) and shares it with all the other SUs. By averaging the received \(r_j^L\) values, j ≠ i, each \(SU_{i}^{tx}\) computes the average network reward \(r^G=\frac {\sum _{0\leq i < N} r_i^L}{N}\), which is a proxy for the network throughput. Second, once computing the rG value, each \(SU_{i}^{tx}\) updates the Q-table for the TRANSMIT action on channel fj by following the FMQ rule [27], i.e.:
    $$\displaystyle \begin{aligned} \begin{array}{rcl} {} Q(<f_j, \mathtt{IDLE}>, \mathtt{TRANSMIT})&\displaystyle =&\displaystyle Q(<f_j, \mathtt{IDLE}>, \mathtt{TRANSMIT})\\ &\displaystyle +&\displaystyle \frac{C^i_{\mathrm{max}}(r^G_{\mathrm{max}}, f_j)}{C^i(f_j)} \cdot r^G_{\mathrm{max}}(f_j) \end{array} \end{aligned} $$
    where \(r^G_{\mathrm {max}}(f_j)\) is the maximum global reward observed when \(SU_{i}^{tx}\) is tuned to channel fj, \(C^i_{\mathrm {max}}(r^G_{\mathrm {max}}, f_j)\) is the number of times such values has been observed, and Ci(fj) is the total number of transmission attempts on frequency fj. As a result, each SUs pushes its policy toward channels where an optimal network reward rG is achieved, although no SU keeps track on the global MARL Q-table nor it makes conjectures about the opponents’ behaviors; the rG value reflects indirectly the optimality of the joint action performed by the other SUs, and based on it each SU adjusts its own policy.
In Fig. 10a, b, and c, we compare the performance of the DistQ-learning scheme against the sequential and the IDQ-learning algorithms previously introduced in section “Analysis I: SU-PU Interference Only”. We consider the same network environment of the previous analysis, except for the number of licensed frequencies (K= 9 instead of K = 6); the channels 7-8-9 have the same PUL-PER profiles of channels 3-4-5 (see Table 1). Figure 10a shows the network throughput when varying the number of transmitting SUs (N). The throughput values are lower than Fig. 8a and also decrease with N as a consequence of the SU contention on each channel. We can notice that the DistQ-learning scheme provides significantly better performance than the sequential scheme but also than the IDQ-learning scheme. Using this latter, all the SUs attempt to discover the same policy, i.e., they transmit on the same channels and balance TRANSMIT and SENSE actions in the same way. Vice versa, the DistQ-learning aim at achieving implicit coordination among SUs through Eq. 16; the SUs learn differentiated policies – at least regarding the Q-value of the TRANSMIT action on each channel – so that the maximum, network-wide reward can be achieved. This is also visible in Fig. 10b which shows the network throughput on each of the K channels, for the three different algorithms and N = 20. The IDQ-learning scheme concentrates most of the SU transmissions on channel 8 and 4 (which are the most favorable to SUs for PUL/PER profiles) but clearly increasing the contention level on those frequencies and hence the risk of packet losses due to SU-SU collisions. Vice versa, the DistQ-learning scheme achieves better distribution of the SUs over the available spectrum opportunities, which also translates into enhancements in terms of PDR, as depicted in Fig. 10c.
Fig. 10

The throughput and packet delivery ratio (PDR) of the sequential and of the RL-based schemes are shown in (a) and (c), respectively. The throughput on each of the K = 9 channel, for N=20, is shown in (b)

Table 1

PUL/PER profiles of the K = 5 licensed channels

Channel index



PUL profile: < α, β > 





< 10,2> 





< 5,5> 





< 2,10> 





< 10,2> 





< 5,5> 





< 2,10> 


Conclusions and Open Issues

In this paper, we have addressed the utilization of reinforcement learning (RL) techniques in cognitive radio (CR) networks. A twofold taxonomy of the existing RL-CR studies has been proposed, from a networking perspective and a learning perspective. The review of the literature has confirmed that RL techniques are quite popular in RL networking and that they have been applied on several different use-cases and on dynamic network scenarios, often enhancing the performance of non-learning-based solutions. At the same time, despite the number of published papers, we believe there is great room for improvement, since some RL-CR issues have been only superficially addressed by the academic research. We focus here on three main research issues:
  • Accurate performance evaluation in real-world CR network scenarios. Despite few experimental works [68, 71, 72], the evaluation of RL-based solutions have been mainly conducted through simulation studies, in which the spectrum occupancy pattern is modeled by using well-known probability distribution (like the exponential [4] or the Bernoulli [56] distributions). However, spectrum bands might exhibit different complexity of the PU occupancy pattern, based the signal waveform and regulations of licensed users transmitting on those frequency [70]. Hence, differentiated RL learning algorithms (e.g., by considering model-free or model-based approaches, MDP or POMDP frameworks) can be tested and deployed on different frequencies, also based on the predictability of the spectrum availability. Additional works based on real-world spectrum traces are required in order to address this issue.

  • Analysis of RL techniques for CR applications with strict QoS network requirements. Several multimedia applications pose strict QoS requirements (e.g., the maximum packet drop rate or jitter for video-streaming services) that must be continuously met by the network, in order not to affect negatively the users’ experience. In RL-CR solutions, the SUs must continuously balance the exploitation and exploration phases during the system lifetime: when the SU selects random, possibly suboptimal actions, the QoS requirements of the CR multimedia applications are not guaranteed. Proper techniques that provide effective state-action exploration without causing detectable performance degradation for CR applications with strict QoS network requirements have not designed so far.

  • Enhancement of the learning framework. Most of the reviewed works in the RL-CR literature provide an accurate modeling of the CR network scenario, including the operations of the main actors (PUs and SUs); the same level of complexity cannot be found on the learning part, since in most cases, the RL framework consists in a straightforward application of model-free algorithms (Q-learning or Sarsa algorithms overall). However, the RL theory is vast and is not limited to such results [12]; moreover, it is continuously extended by novel contributions coming from an active research community [73]. CR networking can benefit from the novel RL architectures introduced so far: we cite, among others, the utilization of deep RL techniques (i.e., RL framework enhanced with artificial neural networks for state-action representation) for better coordination in distributed scenarios [74] or for more efficient exploration [75].

About the second contribution of this paper, i.e., the application of RL techniques on joint spectrum sensing and selection problems, the evaluation results have revealed that the information-sharing schemes can considerably enhance the performance of RL-based schemes in fully cooperative tasks, like SUs seeking for PU-vacant channels. Vice versa, in mixed cooperative/competitive tasks, like SUs seeking for PU-vacant channels and also minimizing their mutual interference, information-sharing schemes can be counterproductive unless enhanced with distributed coordination strategies. Current and future research activities on the case study include the convergence analysis in multi-agent scenarios, the implementation of proposed RL schemes on a small-case test-bed, and the extension of the proposed modeling approach for the case of partially observable environments.


  1. 1.

    The classification rule is based on the occurrence of specific keywords in the paper title.

  2. 2.

    In MAB theory [55], the regret is defined as the expected difference between the reward sum associated with an optimal strategy and the sum of the collected rewards of the actual strategy.


  1. 1.
    Akyildiz IF, Lee WY, Vuran MC, Mohanty S (2006) NeXt generation/dynamic spectrum access/cognitive radio wireless networks: a survey. Comput Netw J 50(1):2127–2159Google Scholar
  2. 2.
    Mitola J (2000) Cognitive radio an integrated agent architecture for software defined radio. PhD Dissertation, KTH StockholmGoogle Scholar
  3. 3.
    Yucek T, Arslan H (2009) A survey of spectrum sensing algorithms for cognitive radio applications. J IEEE Commun Surv Tutor 11(1):116–130Google Scholar
  4. 4.
    Lee WY, Akyildiz I (2008) Optimal spectrum sensing framework for cognitive radio networks. IEEE Trans Wirel Commun 7(10):3845–3857Google Scholar
  5. 5.
    Sherman M, Mody AN, Martinez R, Rodriguez C, Reddy R (2008) IEEE standards supporting cognitive radio and networks, dynamic spectrum access, and coexistence. IEEE Commun Mag 46(7):72–79Google Scholar
  6. 6.
    Flores AB, Guerra RE, Knightly EW (2013) IEEE 802.11af: a standard for TV white space spectrum sharing. IEEE Commun Mag 51(10):92–100Google Scholar
  7. 7.
    Clancy C, Hecker J, Stuntbeck E, OShea T (2007) Applications of machine learning to cognitive radio networks. IEEE Wirel Commun 14(4):47–52Google Scholar
  8. 8.
    Mitchell T (1997) Machine learning. McGraw Hill, New YorkGoogle Scholar
  9. 9.
    Gavrilovska L, Atanasovksi V, Macaluso I, DaSilva L (2013) Learning and reasoning in cognitive radio networks. IEEE Commun Surv Tutor 15(4):1761–1777Google Scholar
  10. 10.
    Bkassiny M, Li Y, Jayaweera SK (2013) A survey on machine-learning techniques in cognitive radios. IEEE Commun Surv Tutor 15(3):1136–1159Google Scholar
  11. 11.
    Wang W, Kwasinksi A, Niyato D, Han Z (2016) A survey on applications of model-free strategy learning in cognitive wireless networks. IEEE Commun Surv Tutor 18(3):1717–1757Google Scholar
  12. 12.
    Barto AG, Sutton R (1998) Reinforcement learning: an introduction. MIT Press, CambridgeGoogle Scholar
  13. 13.
    Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4(1):237–285Google Scholar
  14. 14.
    Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern 38(2):156–171Google Scholar
  15. 15.
    Busoniu L, Babuska R, De Schutter B (2006) Multi-agent reinforcement learning: a survey. In: Proceedings of IEEE ICARCV, SingaporeGoogle Scholar
  16. 16.
    Watkins CJ, Dayan P (1992) Technical note: Q-learning. Mach Learn 8(1):279–292Google Scholar
  17. 17.
    Rummery GA, Niranjan M (1994) Online Q-learning using connectionist systems. Technical ReportGoogle Scholar
  18. 18.
    Di Felice MK, Wu C, Bononi L, Meleis W (2010) Learning-based spectrum selection in cognitive radio ad hoc networks. In: Proceedings of IEEE/IFIP WWIC, LuleaGoogle Scholar
  19. 19.
    Yau KLA, Komisarczuk P, Teal PD (2012) Reinforcement learning for context awareness and intelligence in wireless networks: review, new features and open issues. J Netw Comput Appl 35(1):235–267Google Scholar
  20. 20.
    Yau KLA, Komisarczuk P, Teal PD (2010) Applications of reinforcement learning to cognitive radio networks. In: Proceedings of IEEE ICC, CapetownGoogle Scholar
  21. 21.
    Raza Syed A, Alvin Yau KL, Qadir J, Mohamad H, Ramli N, Loong Keoh S (2016) Route selection for multi-hop cognitive radio networks using reinforcement learning: an experimental study. In: Proceedings of IEEE access 4(1):6304–6324Google Scholar
  22. 22.
    Vucevic N, Akyildiz IF, Romero JP (2010) Cooperation reliability based on reinforcement learning for cognitive radio networks. In: Proceedings of IEEE SDR, BostonGoogle Scholar
  23. 23.
    Jiang T, Grace D, Mitchell PD (2011) Efficient exploration in reinforcement learning-based cognitive radio spectrum sharing. IET Commun 5(10):1309–1317Google Scholar
  24. 24.
    Ozekin E, Demirci FC, Alagoz F (2013) Self-evaluating reinforcement learning based spectrum management for cognitive ad hoc networks. In: Proceedings of IEEE ICOIN, BangkokGoogle Scholar
  25. 25.
    Macaluso I, DaSilva L, Doyle L (2012) Learning Nash equilibria in distributed channel selection for frequency-agile radios. In: Proceedings of IEEE ECAI, MontpellierGoogle Scholar
  26. 26.
    Lall S, Sadhu AK, Konar A, Mallik KK, Ghosh S (2016) Multi-agent reinforcement learning for stochastic power management in cognitive radio network. In: Proceedings of IEEE Microcom, DurgapurGoogle Scholar
  27. 27.
    Kapetanakis S, Kudenko D (2002) Reinforcement learning of coordination to cooperative multi-agent systems. In: Proceedings of AAAI, Menlo ParkGoogle Scholar
  28. 28.
    Wahab B, Yang Y, Fan Z, Sooriyabandara M (2009) Reinforcement learning based spectrum-aware routing in multi-hop cognitive radio networks. In: Proceedings of IEEE CROWNCOM, HannoverGoogle Scholar
  29. 29.
    Chowdhury K, Wu C, Di Felice M, Meleis W (2010) Spectrum management of cognitive radio using multi-agent reinforcement learning. In: Proceedings of IEEE AAMAS, TorontoGoogle Scholar
  30. 30.
    Faganello LR, Kunst R, Both CB (2013) Improving reinforcement learning algorithms for dynamic spectrum allocation in cognitive sensor networks. In: Proceedings of IEEE WCNC, ShanghaiGoogle Scholar
  31. 31.
    Wu Y, Hu F, Kumar S, Zhu Y, Talari A, Rahnavard N, Matyjas JD (2014) A learning-based QoE-driven spectrum handoff scheme for multimedia transmissions over cognitive radio networks. IEEE J Sel Areas Commun 32(11):2134–2148Google Scholar
  32. 32.
    Chen X, Zhao Z, Zhang H (2013) Stochastic power adaptation with multiagent reinforcement learning for cognitive wireless mesh networks. IEEE Trans Mob Comput 12(11):2155–2166Google Scholar
  33. 33.
    Zhou P, Chang Y, Copeland JA (2010) Learning through reinforcement for repeated power control game in cognitive radio networks. In: Proceedings of IEEE Globecom, MiamiGoogle Scholar
  34. 34.
    Di Felice M, Chowdhury K, Kim W, Kassler A, Bononi L (2011) End-to-end protocols for cognitive radio ad hoc networks: an evaluation study. Perform Eval (Elsevier) 68(9):859–875Google Scholar
  35. 35.
    Reddy YB (2008) Detecting primary signals for efficient utilization of spectrum using Q-learning. In: Proceedings of IEEE ITNG, Las VegasGoogle Scholar
  36. 36.
    Berhold U, Fu F, Van Der Schaar M, Jondral FK (2008) Detection of spectral resources in cognitive radios using reinforcement learning. In: Proceedings of IEEE Dyspan, pp 1–5Google Scholar
  37. 37.
    Di Felice M, Chowdhury KR, Kassler A, Bononi L (2011) Adaptive sensing scheduling and spectrum selection in cognitive wireless mesh networks. In: Proceedings of IEEE Flex-BWAN, MauiGoogle Scholar
  38. 38.
    Arunthavanathan S, Kandeepan S, Evans RJ (2013) Reinforcement learning based secondary user transmissions in cognitive radio networks. In: Proceedings of IEEE Globecom, AtlantaGoogle Scholar
  39. 39.
    Mendes AC, Augusto CHP, da Silva MWR, Guedes RM, de Rezende JF (2011) Channel sensing order for cognitive radio networks using reinforcement learning. In: Proceedings of IEEE LCN, BonnGoogle Scholar
  40. 40.
    Lo BF, Akyldiz IF (2010) Reinforcement learning-based cooperative sensing in cognitive radio ad hoc networks. In: Proceedings of IEEE PIMRC, IstanbulGoogle Scholar
  41. 41.
    Lunden J, Kulkarni SR, Koivunen V, Poor HV (2011) Exploiting spatial diversity in multiagent reinforcement learning based spectrum sensing. In: Proceedings of IEEE CAMSAP, San JuanGoogle Scholar
  42. 42.
    Lunden J, Kulkarni SR, Koivunen V, Poor HV (2013) Multiagent reinforcement learning based spectrum sensing policies for cognitive radio networks. IEEE J Sel Top Signal Process 7(5):858–868Google Scholar
  43. 43.
    Jao Y, Feng Z (2010) Centralized channel and power allocation for cognitive radio network: a Q-learning solution. In: Proceedings of IEEE FNMS, FlorenceGoogle Scholar
  44. 44.
    Galindo-Serrano A, Giupponi L, Blasco P, Dohler M (2010) Learning from experts in cognitive radio networks: the docitive paradigm. In: Proceedings of IEEE CROWNCOM, CannesGoogle Scholar
  45. 45.
    Galindo-Serrano A, Giupponi L (2010) Distributed Q-learning for aggregated interference control in cognitive radio networks. IEEE Trans Veh Tech 59(4):1823–1834Google Scholar
  46. 46.
    Chowdhury KR, Di Felice M, Doost-Mohammady R, Meleis W, Bononi L (2011) Cooperation and communication in cognitive radio networks based on TV spectrum experiments. In: Proceedings of IEEE WoWMoM, LuccaGoogle Scholar
  47. 47.
    Emre M, Gur G, Bayhan S, Alagoz F (2015) CooperativeQ: energy-efficient channel access based on cooperative reinforcement learning. In: Proceedings of IEEE ICCW, LondonGoogle Scholar
  48. 48.
    Saad H, Mohamed A, ElBatt T (2012) Distributed cooperative Q-learning for power allocation in cognitive femtocell networks. In: Proceedings of IEEE VTC-Fall, Quebec CityGoogle Scholar
  49. 49.
    Venkatraman P, Hamdaoui B, Guizani M (2010) Opportunistic bandwidth sharing thorough reinforcement learning. IEEE Trans Veh Tech 59(6):3148–3153Google Scholar
  50. 50.
    Bernardo F, Augusti R, Perez-Romero J, Sallent O (2010) Distributed spectrum management based on reinforcement learning. In: Proceeding of IEEE CROWNCOM, HannoverGoogle Scholar
  51. 51.
    Yau KLA, Komisarczuk P, Teal PD (2010) Context-awareness and intelligence in distributed cognitive radio networks: a reinforcement learning approach. In: Proceedings of IEEE AusCTW, CanberraGoogle Scholar
  52. 52.
    Yau KLA, Komisarczuk P, Teal PD (2010) Enhancing network performance in distributed cognitive radio networks using single-agent and multi-agent reinforcement learning. In: Proceedings of IEEE LCN, DenverGoogle Scholar
  53. 53.
    Yau KLA, Komisarczuk P, Teal PD (2010) Achieving context awareness and intelligence in distributed cognitive radio networks: a payoff propagation approach. In: Proceedings of IEEE WAINA, SingaporeGoogle Scholar
  54. 54.
    Kakalou I, Papadimitriou GI, Nicopoliditis P, Sarigiannidis PG, Obaidat MS (2015) A reinforcement learning-based cognitive MAC protocol. In: Proceedings of IEEE ICC, LondonGoogle Scholar
  55. 55.
    Agrawal R (1995) Sample mean based index policies with o(log(n)) regret for the multi-armed bandit problem. Adv Appl Prob 27(1):1054–1078Google Scholar
  56. 56.
    Robert C, Moy C, Wang CX (2014) Reinforcement learning approaches and evaluation criteria for opportunistic spectrum access. In: Proceeding of IEEE ICC, SydneyGoogle Scholar
  57. 57.
    Jouini W, Di Felice M, Bononi L, Moy C (2012) Coordination and collaboration in secondary networks: a multi-armed bandit based framework. In: Technical Report. Available at:
  58. 58.
    Li H (2010) Multi-agent Q-learning for competitive spectrum access in cognitive radio systems. In: Proceedings of IEEE SDR, BostonGoogle Scholar
  59. 59.
    Alsarhan A, Agarwal A (2010) Resource adaptations for revenue optimization in cognitive mesh network using reinforcement learning. In: Proceedings of IEEE GLOBECOM, MiamiGoogle Scholar
  60. 60.
    Teng Y, Zhang Y, Niu F, Dai C, Song M (2010) Reinforcement learning based auction algorithm for dynamic spectrum access in cognitive radio networks. In: Proceedings of IEEE VTC Fall, OttawaGoogle Scholar
  61. 61.
    Cesana M, Cuomo F, Ekici E (2011) Routing in cognitive radio networks: challenges and solutions. Ad Hoc Netw (Elsevier) 9(3):228–248Google Scholar
  62. 62.
    Chowdhury KM, Di Felice (2009) SEARCH: a routing protocol for mobile cognitive radio ad-hoc networks. Comput Commun (Elsevier) 32(18):1983–1997Google Scholar
  63. 63.
    Litman M, Boyan J (1994) Packet routing in dynamically changing networks: a reinforcement learning approach. Adv Neural Inform Process Syst 7(1):671–678Google Scholar
  64. 64.
    Chetret D, Tham C, Wong L (2004) Reinforcement learning and CMAC-based adaptive routing for MANETs. In: Proceedings of IEEE ICON, SingaporeGoogle Scholar
  65. 65.
    Al-Rawi AHA, Alvin Yau KL, Mohamad H, Ramli N, Hashim W (2014) A reinforcement learning-based routing scheme for cognitive radio ad hoc networks. In: Proceedings of IEEE WMNC, VilamouraGoogle Scholar
  66. 66.
    Zheng K, Li H, Qiu RC, Gong S (2012) Multi-objective reinforcement learning based routing in cognitive radio networks: walking in a random maze. In: Proceedings of IEEE ICNC, MauiGoogle Scholar
  67. 67.
    Safdar T, Hasbulah HB, Rehan M (2015) Effect of reinforcement learning on routing of cognitive radio ad hoc networks. In: Proceedings of IEEE ISMSC, IponGoogle Scholar
  68. 68.
    Pourpeighhambar B, Dehghan M, Sabaei M (2017) Non-cooperative reinforcement learning based routing in cognitive radio networks. Comput Commun (Elsevier) 106(1):11–23Google Scholar
  69. 69.
    Dowling J, Curran E, Cunningham R, Cahill V (2005) Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing. IEEE Trans Syst Man Cybern 35(3):360–372Google Scholar
  70. 70.
    Macaluso I, Finn D, Ozgul BAL, DaSilva (2013) Complexity of spectrum activity and benefits of reinforcement learning for dynamic channel selection. IEEE J Sel Areas Commun 31(11):2237–2246Google Scholar
  71. 71.
    Ren Y, Dmochowski P, Komisarczuk P (2010) Analysis and implementation of reinforcement learning on a GNU radio cognitive radio platform. In: Proceedings of IEEE CROWNCOM, CannesGoogle Scholar
  72. 72.
    Moy C, Nafkha A, Naoues M (2015) Reiforcement learning demonstrator for opportunistic spectrum access on real radio signals. In: Proceedings of IEEE DySPAN, StockholmGoogle Scholar
  73. 73.
    Dayan P, Niv Y (2008) Reinforcement learning: the good, the bad and the ugly. Curr Opin Neurobiol 18(1):1–12Google Scholar
  74. 74.
    Naparstek O, Cohen K (2017) Deep multi-user reinforcement learning for distributed dynamic spectrum access. In: CoRR abs/1704.02613Google Scholar
  75. 75.
    Ferreira VP, Paffenroth R, Wyglinski RMA, Hackett MT, Bilen GS, Reinhart CR, Mortense JD (2017) Multi-objective reinforcement learning-based deep neural networks for cognitive space communications. In: Proceedings of IEEE CCAA, ClevelandGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly

Section editors and affiliations

  • Yue Gao
    • 1
  1. 1.School of Electronic Engineering and Computer ScienceQueen Mary University of LondonLondonUK

Personalised recommendations