Handbook of Cognitive Radio pp 1-38 | Cite as

# Reinforcement Learning-Based Spectrum Management for Cognitive Radio Networks: A Literature Review and Case Study

- 1 Citations
- 290 Downloads

## Abstract

In cognitive radio (CR) networks, the cognition cycle, i.e., the ability of wireless transceivers to learn the optimal configuration meeting environmental and application requirements, is considered as important as the hardware components which enable the dynamic spectrum access (DSA) capabilities. To this purpose, several machine learning (ML) techniques have been applied on CR spectrum and network management issues, including spectrum sensing, spectrum selection, and routing. In this paper, we focus on reinforcement learning (RL), an online ML paradigm where an agent discovers the optimal sequence of actions required to perform a task via trial-end-error interactions with the environment. Our study provides both a survey and a proof of concept of RL applications in CR networking. As a survey, we discuss pros and cons of the RL framework compared to other ML techniques, and we provide an exhaustive review of the RL-CR literature, by considering a twofold perspective, i.e., an application-driven taxonomy and a learning methodology-driven taxonomy. As a proof of concept, we investigate the application of RL techniques on joint spectrum sensing and decision problems, by comparing different algorithms and learning strategies and by further analyzing the impact of information sharing techniques in purely cooperative or mixed cooperative/competitive tasks.

## Introduction

A cognitive radio (CR) can be defined as a wireless device that is able to autonomously control its configuration based on the environmental conditions and on the quality of service (QoS) requirements of the applications [1]. Since its original proposal in 1999 [2], the node architecture has been considered the core novelty of a CR device, being the fusion of advanced dynamic spectrum access (DSA) functionalities at the radio level and of intelligent decision-making provided by a cognition module (CM) at the software level. Through the DSA, a CR is able to observe the network environment and to dynamically adjust transmission parameters like the operative frequency, the modulation and coding scheme, or the power level. To this purpose, the dynamic reuse of vacant portion of the licensed spectrum, in overlay or underlay mode, has emerged as the prominent use-case of the CR technology: CR devices, also known as secondary users (SUs), aim to maximally exploit all the available spectrum frequencies, including both licensed and unlicensed ones, without affecting the performance of the frequency owners, also known as primary users (PUs) [1]. The research literature on channel sensing techniques, required to detect PU-free transmission opportunities on the frequency and time domains, is vast [3, 4], as well as the number of proposed network architecture and standards regulating the operations of the SUs [5, 6]. On top of the DSA module, the CM leverages the perceptions and measurements gathered during the sensing phase for the decision-making process, i.e., to properly adjust the radio configuration and plan the network operations, by means of advanced learning and reasoning functionalities [7]. For this reason, a significant portion of the literature on CR networking is investigating the utilization of machine learning (ML) techniques [8] for the device and network configuration, optimization, and planning; the ML approaches adopted so far are extremely heterogeneous and include supervised learning techniques (e.g., neural networks and Bayesian classifiers), unsupervised learning techniques, and dynamic games [9, 10, 11].

In this paper, we focus on reinforcement learning (RL) [12, 13], a well-known ML paradigm where the agent learns the optimal sequence of actions in order to fulfill a specific task via trial-and-error interactions with a dynamic environment; at each action performed, the agent observes its current state and receives a numeric reward, which quantifies the effectiveness of the action. The agent behavior, also known as the *policy*, should choose actions that tend to increase the long-term sum of rewards [13]. The literature on RL dates back to the 1960s [12] and comprises several different techniques and variants [14, 15, 16, 17]. The online nature of the learning process fits well the architecture of a CR device: the DSA module provides context-awareness via explicit feedbacks and channel measurements, and based on such rewards, the RL-CM is able to learn the optimal state-action mapping. Differently from supervised learning [8], RL algorithms might work without assuming any previous knowledge of the environment and of the reward function [11]. At the same time, a RL agent continuously adjusts its current policy based on the interactions with the environment: hence, policy adaptiveness is implicitly addressed also in dynamic and nonstationary environments.

This property is particularly interesting in CR networking scenarios, which are dynamic by nature due to the mobility of the SU devices, the PU activity patterns, and the likely varying propagation and traffic load conditions, and constitutes another significant advantage compared to traditional optimization approaches. Thanks to these benefits, several recent works have demonstrated that RL techniques can be applied on spectrum management issues [18, 19, 20], including channel sensing, channel selection, or power control problems, as well as on many CR network management issues, including routing, cooperation control, and security [21, 22]. At the same time, the application of RL techniques in CR scenarios hides a number of technical challenges, like the impact of exploration phase on the system performance Ozekin et al. [23, 24] and the convergence in distributed environments characterized by the presence of SUs that compete for shared resources (e.g., channel frequency) while cooperating on keeping the aggregated interference below a predefined QoS threshold [25, 26].

This paper investigates the application of RL techniques on CR networking by providing two kinds of scientific contributions, i.e., (i) a survey of the RL-CR-related literature, which can serve also as a tutorial for readers approaching the topic for the first time, and (ii) a proof of concepts of RL techniques on novel CR use-cases. Regarding the survey/tutorial, after a brief presentation of the RL theory and of the main algorithms, we discuss advantages and drawbacks of the RL framework for CR networking, and we compare it against other ML approaches. We then provide an up-to-date and exhaustive review of the RL-CR-related literature through a twofold taxonomy. The first taxonomy is based on the CR application domains, focusing on spectrum management and network configuration issues (i.e., spectrum sensing, decision, power allocation, and routing); on each category, we further classify the studies according to the proposed goal being addressed. The second taxonomy is learning methodology-driven, i.e., we review the literature according to specific RL modeling features which are orthogonal to the application domain, like the modeling of the environment and of the reward function. Regarding the proof of concepts, we describe a novel application of RL techniques on joint channel sensing and decision problems (hence, combining two research issues which are treated separately in the survey): more specifically, we show how the SUs can autonomously learn the optimal channel allocation, as well as the optimal balance of sensing/transmitting actions on each channel, so that the secondary network performance are maximized, while the harmful interference to PU receivers are kept below a QoS threshold. We formulate the problem as an instance of a Markov decision process (MDP) [12, 13], and we tested different algorithms (Q-Learning and Sarsa) and learning models, on two different task goals: independent learning and collaborative agents on a fully cooperative task (e.g., PU-SU interference minimization) and distributed coordinating agents on a mixed cooperative/competitive task (e.g., SU-SU and PU-SU interference minimization). The experimental results show that RL-based solutions can greatly enhance the performance in dynamic CR environments compared to non-learning-based solutions; at the same time, they unveil the impact of RL parameter tuning, knowledge sharing techniques, and algorithm selection, hence paving the way to further researches on the topic.

The rest of the paper is structured as follows. Section “Related Works” reviews the existing surveys addressing ML and RL applications in CR networks and points out the novelties of this paper. Section “Overview of Reinforcement Learning” provides an overview of the RL theory, by introducing a taxonomy of the existing techniques and by also summarizing the operations of the most popular RL algorithms. Advantages and drawbacks of RL-CR approaches are discussed in section “Reinforcement Learning in Cognitive Radio Scenarios: Pros and Cons”. Section “Reinforcement Learning in Cognitive Radio Scenarios: Applications-Driven Taxonomy” reviews the existing RL-CR studies according to an application-driven taxonomy. The existing works are further classified by means of a learning methodology-driven taxonomy in section “Reinforcement Learning in Cognitive Radio Scenarios: Learning Methodology-Driven Taxonomy”. The case study is presented in section “Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks”, together with the RL formulation, proposed algorithms and performance evaluation results. Conclusions are drawn in section “Conclusions and Open Issues”.

## Related Works

The most comprehensive surveys investigating the applications of ML techniques on CR networking are probably [9, 10], and [11]. More specifically, [10] describes the existing applications of ML techniques on CR networking, considering both supervised and unsupervised learning techniques and including also the RL-based approaches. Moreover, the authors investigate the learning challenges in non-Markovian environments and discuss policy-gradient algorithms. An impressive review of model-free learning-based solutions in CR networks is presented in [11], where the existing works are grouped in three main categories, i.e.: (i) strategy-learning schemes based on single-agent systems, (ii) strategy-learning schemes based on loosely coupled multi-agent systems, and (iii) strategy-learning schemes in the context of games. In [9], the authors survey the ML-CR literature by considering an interesting distinction between learning aspects of cognition – which include RL and dynamic games – and reasoning aspects. These latter are in charge of applying inference on the acquired and the learned knowledge, hence enriching the current knowledge base; applications of policy-based reasoning to predict spectrum handover operations or to enhance spectrum opportunity detections are evaluated in a test-bed [9]. The strict relationship occurring between learning and reasoning in CR networks is also investigated in [7]. By focusing on the RL-CR literature, the authors of [20] demonstrate how the RL framework, and in particular the Q-routing algorithm, can be utilized as modeling tool in four different problems, regarding dynamic channel selection (DCS), DCS and route selection, DCS and congestion control, and packet scheduling in QoS environments. Similarly, the authors of [18] show how three different CR problems (routing, channel sensing, and decision) can be modeled via the Markov decision process (MDP) introduced by the RL framework. Applications, implementations, and open issues of RL techniques in CR networks are extensively discussed in [19], which is the work most similar to our paper. Our paper provides two additional contributions compared to [19]: (i) it provides an up-to-date review of the RL-CR literature from two different perspective, i.e., a CR networking perspective and a learning perspective, and (ii) it evaluates gains and drawbacks of the RL framework on a realistic CR use-case, addressing joint spectrum sensing, and selection.

## Overview of Reinforcement Learning

*S*,

*A*,

*R*,

*ST*> , where:

*S*is the (discrete) set of*available States*; let*s*_{t}denote the current state of the agent at time*t*.*A*is the (discrete) set of*Actions*; let*A*(*s*_{t}) denote the set of actions available in state*s*_{t}.\(R: S \times A \rightarrow \Re \) is the

*Reward*function indicating the numeric reward received at each state/action; more specifically, let*r*_{t}indicate the reward received by the agent while being in state*s*_{t}and executing action*a*_{t}∈*A*(*s*_{t}).*ST*:*S*×*A*→*S*is the*State Transition*function, which indicates the next state \(s^{\prime }_{t+1}\) after executing action*a*_{t}∈*A*(*s*_{t}) from state*s*_{t}; in case of nondeterministic environments, the*ST*function is a probabilistic distribution over the set of actions and states, i.e.,*ST*:*S*×*A*×*S*→ [0 : 1].

*π*:

*S*→

*A*, which indicates, for each state

*s*

_{t}, the proper action

*a*

_{t}to execute. Similarly to the state transition function, also the policy function can be modeled as a probabilistic distribution over the set of actions and states, i.e.,

*π*:

*S*×

*A*→ [0 : 1]. The goal of the agent is to discover the optimal policy

*π*

^{∗}which maximizes a specific function of the received rewards over time. In the infinite-horizon discount model [13], the policy aims to maximize the long-run expected reward; however, it discounts the rewards received in the future, i.e.:

*γ*≤ 1 is a factor discounting the future rewards. If

*γ*= 0, the agent aims to maximize the immediate rewards.In order to compute the optimal policy, several RL algorithms employ two additional data structures: the state-value function (

*V*

^{π}) and the state-action function (

*Q*

^{π}) [12, 13]. For each state

*s*∈

*S*, the state-value function

*V*

^{π}(

*s*

_{t}) represents the expected reward when following policy

*π*from state

*s*

_{t}. The

*V*

^{π}(

*s*

_{t}) value can be computed as follows:

*Q*

^{π}(

*s*

_{t},

*a*

_{t}) represents the expected reward when the agent is in state

*s*

_{t}, executes action

*a*

_{t}, and then follows the policy

*π*. More formally:

### SARL Algorithms

*π*

^{∗}, including dynamic programming (DP), Monte Carlo-based, and temporal-difference (TD) learning algorithms. DP techniques assume a perfect knowledge of the environment, i.e., of the reward (

*R*) and of the state transition (

*ST*) functions; hence, the exact value of

*V*

^{π}(⋅) can be computed by solving Eq. 2. The DP algorithms alternate between a policy-evaluation phase, during which the value of the current policy

*V*

^{π}(

*s*) is determined for each state

*s*∈

*S*, and a policy improvement phase, where the current policy

*π*is modified into

*π′*so that

*π′*(

*s*) =

*argmax*

_{a ∈ A}

*Q*

^{π}(

*s*,

*a*) [12]. Monte Carlo methods do not assume the knowledge of the environment, but they are mainly used on episodic tasks [12]. Vice versa, TD methods implement an online, step-by-step learning process without assuming a model of the environmental dynamics. Among the several existing TD-based solutions, we cite the popular Sarsa and Q-learning algorithms [16, 17]: they both update the Q-table after each received reward, till converging to the optimal

*Q*

^{∗}values. More specifically, each time the agent chooses action

*a*

_{t}from state

*s*

_{t}(receiving reward

*r*

_{t}), and action

*a*

_{t+1}from next state

*s*

_{t+1}, the Sarsa algorithm [17] updates the

*Q*(

*s*

_{t},

*a*

_{t}) entry as follows:

*α*is a learning rate factor. The Q-learning algorithm [16] employs a slightly different update rule, since it is independent from the policy being followed (offline policy learning), i.e.:

*Q*

^{∗}values, under the assumption that all state-action pairs are visited an infinite number of times, and proper tuning of the

*α*factor [16, 17]. This poses a challenging trade-off between exploration and exploitation actions, i.e.: (i) insufficient exploration might affect the convergence to the optimal

*Q*

^{∗}values, while (ii) excessive exploration might determine performance fluctuations caused by the selection of random actions. A well-known approach to balance exploration and exploitation actions is via the Boltzmann Equation [12], which assigns a probability to each action and state as a graded function of the estimated

*Q*(

*s*,

*a*) value:

*TE*> 0 is the temperature parameter and controls the exploration/exploitation phases. Indeed, high temperature values cause the actions to be all equiprobable, while, if

*TE*→ 0, the greedy action

*a*

^{∗}associated to the highest

*Q*(

*s*,

*a*

^{∗}) value is always selected, for each state

*s*∈

*S*.

### MARL Algorithms

The MARL framework generalizes the MDP to the case of a multi-agent environment. Let *N* be the number of learning agents and *S*^{i} and *A*^{i} be the state and action sets for agent *i*. The state of the MARL at time *t*, \(\widehat {s_t}\), is then defined as the combination of the individual states of the agents, i.e., \(\widehat {s_t}=\{s_t^1, s_t^2, \ldots s_t^N\}\). Similarly, the system action \(\widehat {a_t}\) is defined as the combination of the individual actions performed by the agents, i.e., \(\widehat {a_t}=\{a_t^1, a_t^2, \ldots a_t^N\}\); based on \(\widehat {a_t}\) and \(\widehat {s_t}\), a vector of rewards is produced, i.e., \(\widehat {r_t}=\{r_t^1, r_t^2, \ldots r_t^N\}\). According to the way such rewards are computed, and to the interactions among the agents, [14, 15] further classify MARL techniques as *fully cooperative*, *fully competitive*, or *hybrid*. In the first case, all the agents receive the same reward, i.e., \(r_t^1=\ldots =r_t^N\), and the goal is to determine the optimal joint policy maximizing a common discounted return; although such policy could be also determined via SARL techniques assuming that all the agents keep the full Q-table of \(\widehat {s}\) and \(\widehat {a}\) values, most of the MARL algorithms work by decomposing the Q-table and introducing indirect coordination mechanisms [27]. In fully competitive MARL frameworks, a min-max principle can be applied, for instance, when *N* = 2, \(r_t^1=+\zeta \) and \(r_t^1=-\zeta \) [14]. Finally, hybrid MARL techniques apply on not fully cooperative nor fully competitive problems, where the reward function might assume a complex shape depending on the joint action being implemented by the agents; this is the case, for instance, of agents competing for a shared resource, like SUs determining the optimal channels where to transmit and taking into account the interference caused by other players (we further investigate such use-case in section “Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks”). Hybrid MARL frameworks usually employ distributed coordination techniques derived from the game theory. We do not further elaborate on MARL techniques; interested readers can refer to [14] and [15] for a detailed illustration.

## Reinforcement Learning in Cognitive Radio Scenarios: Pros and Cons

^{1}: 23% of them are based on RL schemes, more than the supervised learning schemes but still less than GT-based approaches. In any case, Fig. 3 shows that there is no ML solution fitting all the solutions. This is because RL techniques provide clear advantages but also formidable drawbacks when applied on CR-related use-cases. About the advantages, RL techniques can considered highly suitable for CR applications because of these characteristics:

- 1.
*Experience-based Learning*. In supervised learning, a cognitive agent must be instructed on how to perform a classification task by means of a knowledge base containing both positive and negative instances. In CR-related applications, building the knowledge base from real experiments can pose practical issues in terms of scalability and costs. Another issue pertains to the generalization of the learning process, i.e., to the problem of classifying novel instances which are considerably different from those occurring in the knowledge base. This aspect is particularly critic in CR environments, since the network performance is affected by a high number of parameters and by environmental conditions (like the PU activity model, the SU traffic load, the channel error rate, etc.); as a result, a transmitting policy learnt by a CR agent via supervised techniques might not be effective on a different network scenarios or even on the same scenario in presence of dynamic changes of environmental conditions. Vice versa, RL techniques do not require the creation of a knowledge base, rather they leverage on trial-and-error interactions with the environments. In addition, some model-free algorithms like Sarsa and Q-learning [11, 16, 17] do not assume an a priori knowledge of the environmental dynamics (i.e., of the reward and state transition functions); as a result, the same learning algorithm deployed on different network scenarios can automatically discover differentiated transmitting policies, without any need of adaptation or tuning of the RL algorithm. - 2.
*Context adaptiveness*. Through the concepts of rewards and Q-values, the RL framework provides effective building blocks in order to implement adaptive, spectrum-aware solutions. Indeed, since any RL agent continuously evaluates its current policy and improves it, any change in the received reward might cause a policy switch, or it might trigger new exploration actions, hence leading to the discovery of better actions to perform in some states. Moreover, the presence of aggregated rewards can indirectly boost the context-awareness in another way. As already said, performance of CR networks can be affected by multiple factors, whose interactions might be difficult to model analytically. Instead of addressing a single factor at a time, a RL agent can observe all the factors as a state, receive an aggregate feedback (e.g., the cost of each transmission), and optimize a general goal as a whole, e.g. throughput [28]. - 3.
*Reduced complexity*. In most cases, RL techniques provide a simple yet effective modeling approach [12]. Model-free RL algorithms like Q-learning or Sarsa require only the storing of the Q-table. The number of state-action values can be further reduced via function approximation techniques; an example related to CR spectrum management can be found in [29]. In addition, it is worth remarking that the update rule of Q-learning or Sarsa algorithms can be implemented in few lines of codes. This feature makes RL techniques suitable also in resource-constrained environments, like CR-based sensor networks [30], where the wireless devices must face severe energy issues.

- 1.
*Continuous Discovery.*Properly balancing the exploitation/exploration phase is a unique challenge of the RL framework [23]. On the one side, RL agents are required to perform random actions in order to explore the state-action space and then compute the optimal policy. In dynamic environments, the exploration phase cannot be ended after the boot phase; rather it must be continuously performed over time. This is the case, for instance, of SUs aimed to learn the available spectrum opportunities in a multiband scenario; while transmitting on a PU-free channel, the SU should also keep track of the opportunities on other channels, so that a spectrum handoff can be quickly performed in case of PU appearance [31]. On the other hand, a random selection might translate into suboptimal actions being executed, e.g., into the selection of low-quality or PU-busy channels, and hence lead to temporary performance degradation. Permanent performance degradation can occur when the exploration phase has been too short, or too long; hence, the optimal trade-off between exploration and exploitation can be complex to achieve, as investigated in section “Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks”. - 2.
*Convergence Speed.*Many RL techniques (specially time discounted methods [12]) guarantee convergence to the optimal policy only if each action is executed in each state an infinite number of times. This is clearly not realistic for most of CR applications; moreover, the fact that environmental conditions can quickly change over time can pose additional requirements on the speed of convergence. The issue is further exacerbated in MARL scenarios, where the optimal joint action must be determined, e.g., in spectrum selection or power adaptation problems [32, 33] where the SUs should maximize their own performance while collectively mitigating the interference to PUs. For these reasons, MARL-based algorithms are often enhanced with GT mechanisms which guarantee the emergence of a Nash equilibrium under specific assumptions [25, 26].

## Reinforcement Learning in Cognitive Radio Scenarios: Applications-Driven Taxonomy

### Spectrum Sensing

In CR, spectrum sensing techniques play the crucial role to identify the available spectrum resources for the SUs [1]. As a result, most of research is focused on advanced signal processing schemes aimed to achieve robust PU detection under different signal-to-noise ratio (SNR) conditions [3]. Beside this, the scheduling of sensing actions is also a crucial task affecting the performance of the SUs [34], mainly due to the fact that half-duplex radios cannot transmit on a channel while listening to it. The optimal sensing schedule can be determined via experiments and analytical models [4] or dynamically learnt via trial-and-error interactions with the environment [35]. About this latter, existing RL-based sensing schemes can be further classified into *individual* or *cooperative* approaches. We discuss them separately in the following.

#### Individual Sensing Scheduling

*S*= {

*s*

_{1},

*s*

_{2}, ..

*s*

_{|S|}} represents the available (licensed) resources. On each channel

*s*, a SU can perform three actions:

*a*_{1}: sense channel*s*and transmit in case the channel is found idle (*exploitation*).*a*_{2}: sense channel*s′*≠*s*(*exploration*).*a*_{3}: switch to channel*s′*≠*s*(*exploitation*).

*a*

_{1}and

*a*

_{2}, the reward is expressed by the number of PU-free subcarriers detected on the sensed channel; vice versa, for action

*a*

_{3}the reward is always equal to zero. The study in [37] extends such formulation, by taking into account the channel switching delay in the reward of action

*a*

_{3}and by decoupling the transmit action from the sensing action; for sensing actions, the reward is equal to 1 in case of PU detection, 0 otherwise. Vice versa, the reward of transmit actions is computed as the average number of MAC retransmissions for each successful data transmission. The simulation results in [37] show that the proposed RL-based scheme is able to dynamically adjust the sensing frequency according to the perceived PU activity on each channel. Similarly, in [38], the authors aim to balance transmission and sensing actions on each channel; a cost function

*C*

_{s}(

*τ*) is decreased each time a sensing action is performed on channel

*s*, and this latter is found idle. When

*C*

_{s}(

*τ*) is lower than a threshold

*Γ*, then the SU can perform a transmission attempt; vice versa, if

*C*

_{s}(

*τ*) >

*Γ*, then SU defers its attempt and keeps sensing the channel. When the channel is found occupied by a PU, the cost function

*C*

_{s}(

*τ*) is reset to a maximum value. In [39], the problem of determining the optimal sequence of channels sensed by each SU is formulated through the RL framework; here, a state is defined as an ordered couple <

*o*

_{k},

*f*

_{j}> , where

*o*

_{k}is the current position at the sensing order and

*f*

_{j}is the

*k*-th channel sensed in the current slot. At each state <

*o*

_{k},

*f*

_{j}> , the list of available actions will include all the channels (not visited yet) which could be sensed at the next position of the sensing order (

*o*

_{k+1}). The reward function for a specific sensing order action takes into account the time spent sensing the channels and the transmission rate experienced by the SU on the selected channels [39].

#### Cooperative Sensing Scheduling

Sensing techniques can be prone to errors in presence of shadowing or multipath fading conditions on the current licensed channel. For this reason, cooperative sensing techniques [3] aim to enhance the PU detection by aggregating channel measurements from multiple SUs and by averaging the gathered results. However, the network overhead might limit the cooperative gain: for instance, the transmission delay might be higher in presence of cooperative sensing, since each SU should gather the measurements from other peers before taking a decision about the spectrum availability. For this reason, studies like [40, 41] and [42] employ the RL framework in order to determine the optimal set of cooperating neighbors for each SU; the goal is to maximize the PU detection accuracy while avoiding unnecessary measurements sharing among correlated SUs. In [40], the set of states *S* for SU *i* coincides with the list of neighbors, plus one start and one end state. There is an action which allows to move from any couple of states; the sequence of actions correspond to the list of cooperative sensing neighbors, i.e., neighbors to query in order to get channel measurements. The reward function combines the amount of correlation among the gathered sensing samples plus the total reporting delay. In [42], the authors investigate how to coordinate the sensing actions of a secondary network, in order to meet the optimal trade-off between two goals: (i) the maximum number of spectrum opportunities is detected, and (ii) the probability of missed detection on each channel is kept below a safety threshold. Such probability is estimated based on the number of SUs currently sensing the channel. Since the SUs cannot directly observe the PU state on each channel, the sensing problem is formulated via a partially observable MDP (POMDP).

### Power Allocation

In both underlay and overlay CR spectrum paradigms, the SUs should properly tune their transmitting power levels so that the probability of generating harmful interference to any active PU is minimized. Differently from spectrum sensing, which can be considered an individual or, in presence of knowledge sharing, a fully cooperative task, power allocation is a natively hybrid competitive/collaborative task, since the reward function, i.e., the aggregated interference perceived by PU receivers, depends on the joint action performed by the SUs, i.e., on the selected transmitting power level of each SU. For this reason, power allocation can be easily modeled via a MARL framework (see section “Overview of Reinforcement Learning”). A straightforward approach in order to determine the optimal power allocation consists in storing the complete MARL Q-table for each state/action/learning agent and by computing the optimal \(\widehat {a_t}\) through Eq. 4. This methodology is employed in [43], assuming a centralized CR network with a single learning agent, i.e., the cognitive base station, which is charge of determining the optimal power level of each SU, based on the cumulative interference caused to the PUs. In distributed deployments, storing and updating the complete MARL Q-table at each SU might not be practical especially when the number of learning agents (i.e., the SUs) increases. For this reason, most of recent works employ decentralized MARL with two different approaches. In the first case (described in section “RL-Based Power Allocation Based on Information Sharing”), the SUs share rewards or rows of their Q-table after each local action, so that the interference caused by the joint action \(\widehat {a_t}\) can be computed. In the other case (described in section “RL-Based Power Allocation Without Information Sharing”), each SU acts according to the local information only, but the secondary network still aims to achieve a global coordination, often expressed by the notion of Nash equilibrium (NE).

#### RL-Based Power Allocation Based on Information Sharing

*SINR*

_{Th}). The problem is modeled through the MDP defined as follows:

The state set is defined as the set of couples \(\{I_t^i,d_t^i\}\), where \(I_t^i \in \{0;1\}\) is a binary indicator specifying whether the CR base station

*i*is generating an aggregated interference above or below the*SINR*_{Th}threshold, and \(d_t^i\) is the approximated distance between*i*and the protection contour region of the primary system.The action set coincides with the discrete set of power levels which can be assigned to each CR base station.

The reward function \(R(i)=(SINR_t^i - SINR_{Th})^2\) expresses the difference between the SINR value measured by SU

*i*and the expected threshold.

*W*

_{i,j}is a measure on how much SU

*i*relies on knowledge produced by SU

*j*. The scheme in [47] employs information sharing in order to speed up the individual learning process; however, there is no guarantee that the optimal joint action will be determined. In order to fulfill this second requirement, the authors of [48] propose a cooperative RL-based power allocation scheme aimed to control the aggregated interference generated by SU femtocells. The MDP model is similar to [44, 45]; however , each SU shares only a row of its Q-table. At each time-slot, a SU chooses action

*a*

_{i}maximizing the summation of the Q-values considering the current states of all the

*N*neighbors, i.e.:

#### RL-Based Power Allocation Without Information Sharing

The state set is defined as the set of couples {

*I*_{i},*p*_{i}(*a*_{i})}, where*I*_{i}∈{0;1} is a binary indicator specifying whether the SINR of SU*i*is higher or lower than a predefined safety threshold and*p*_{i}(*a*_{i}) denotes the current power level.The action set coincides with the discrete set of power levels which can be assigned to each SU.

The reward function

*R*is a proxy for the energy efficiency of the transmission attempt, i.e., of the average number per bits received per unit of energy consumption; if*I*_{i}=1, the reward is set to zero.

*s*

_{t}, SU

*i*updates its Q-table as follows:

*i*regarding the behavior of the other players and is updated as follows:

### Spectrum Selection

Dynamic channel selection (DCS) constitutes the most investigated RL application in the field of wireless networking [20, 49, 50, 51, 52, 53]. In overlay RL networks, each SU must select the proper channel where to transmit in order to fulfill two main requirements: (i) minimize the interference caused to PU receivers tuned on the same or adjacent spectrum bands (*G*_{0}) and (ii) maximize its own performance, by taking into account the channel contention and the MAC collisions caused by other SUs operating on the same band (*G*_{1}). Moreover, the SUs should continuously execute channel selection in order to adapt to dynamic changes of the PU activities, to the traffic loads generated by the SUs, and to varying propagation and channel state conditions. It is easy to notice that the RL framework fits well the requirements of adaptive protocol design. *G*_{0} is usually addressed via the SARL techniques presented in section “SARL-Based DCS”. Vice versa, meeting both *G*_{0} and *G*_{1} requires some form of coordination among the SUs: for this reason, the DCS problem is modeled via MARL techniques enhanced with game theory concepts, so that a stable channel allocation is achieved (details are provided in section “MARL-Based DCS”). Another way of classifying the existing RL-DCS schemes proposed in the literature is by focusing on the learning agent, i.e., on where the RL framework is implemented. The solutions presented in sections “SARL-Based DCS” and “MARL-Based DCS” refer to a scenario where channel selection is performed by each SU, and the PU is unaware about the presence of opportunistic users. Vice versa, in spectrum trading models, the PUs borrow portions of its spectrum to the SUs, receiving in return a monetary revenue; the problem formulation through the RL framework allows determining the optimal portion of spectrum band which can be leased to the SUs without compromising the QoS requirements of the primary network. Details about RL-based spectrum trading schemes are provided in section “RL-Based Spectrum Trading”.

#### SARL-Based DCS

*G*

_{1}is taken into account: a SU might keep adjusting its operating channel as a consequence of channel selection performed by the other SUs. SARL-based DCS schemes can be further classified as

*state-full*or

*stateless*approaches. In the first case, the RL framework contains both actions and states, hence following the traditional structure discussed so far. Examples of state-full SARL-based DCS schemes are presented in [20] and [49]. More specifically, in [49] the authors propose an opportunistic spectrum model, in which each SU is associated to a home band (where it has the right to transmit), but it may also seek for spectrum opportunities in the licensed bands (at condition of minimizing the interference caused to the licensed users). The DCS problem is modeled through the following MDP:

The state set \(S=\{s^i_0,s^i_1,\ldots ,s^i_M\}\) coincides with the list of available channels; \(s^i_0\) corresponds to the home channel of the SU, while \(s^i_1,\ldots ,s^i_M\) are the licensed channels.

The action set

*A*=*a*_{0},*a*_{1}, …*a*_{P}indicates the output of the channel selection process;*a*_{0}is the action of transmitting on the home channel, while action*a*_{i},*i*>0 perform explorations, i.e., the SU will transmit on the*M*licensed frequency by following a specific channel sequence.The reward

*R*is a function of the quality communication level, which can be determined via link-metrics (e.g., SNR or packet success rate).

*A*which often coincides with the list of available channels; executing action

*a*

_{i}corresponds to switching to frequency

*f*

_{i}, sensing it, and transmitting in case no PU activity is detected. In [51], the Q-learning update rule of Eq. 4 is simplified to:

*i*at time-slot

*t*. In order to avoid oscillations in the learning process, sequential exploration is employed, i.e., a single SU can undergo exploration within a neighborhood. In [30], the authors propose three different RL-based DCS schemes, all based on the update rule of Eq. 10 but adopting three different formulation of the reward function, i.e.: the transmission successful rate in each epoch (named Q-learning+ scheme), the SINR metric (named Q-Noise scheme), and the SINR plus the historical behavior of the SUs (named Q-Noise+). A similar approach is also followed in [54], where the SUs aim to learn the optimal channel selection probability and the amount of PU activity on each channel. It is also worth noting that stateless RL frameworks can be considered instances of the multiarmed bandit (MAB) problem [55]. Several MAB-based DCS algorithms have been proposed in the literature. We cite, among others, the study in [56], where the authors compare two popular MAB schemes, named the UCB and the WD techniques, to the DCS problem in RL scenarios, assuming error-free sensing and that the temporal occupation of each channel follows a Bernoulli distribution. The output of the learning process is hence to learn the PU channel occupation probability of each channel, limiting the summation of regret

^{2}over time. The MAB framework of [56] is extended in [57] by considering cooperation techniques, aimed to improve the sensing accuracy, and coordination techniques, aimed to mitigate the impact of secondary interference.

#### MARL-Based DCS

*G*

_{0}and

*G*

_{1}requirements. This is performed via a payoff propagation mechanism, i.e., each SU

*i*maintains – in addition to the

*Q*table – a

*μ*-table with size |

*Γ*(

*i*) ×

*A*| where

*Γ*(

*i*) denotes its set of neighbors and

*A*is the set of actions, which coincides with the channel list. Each time SU plays action

*a*

_{k}(i.e., switches to channel

*a*

_{k}), it transmits a payoff message including its \(Q^i_{t+1}(a_k)\) value, while all the other SUs

*j*∈

*Γ*(

*i*) will store such value in their

*μ*-table. When selecting the next channel \(\widehat {a}_{t+1}\), SU

*i*will take into account both the local Q-table as well as the payoff table, i.e.:

*p*

_{t+1}(

*k*) at each transmission attempt on channel

*k*according to a linear reward-inaction model, i.e.:

*r*

_{t}is a function of the SINR metric perceived by the SU receiver, and

*e*

_{k}is the unit vector [25].

#### RL-Based Spectrum Trading

the state set

*S*coincides with the number of SU traffic classes; the value of*s*_{i}is the number of SU requests accepted belonging to class*i*.the action set

*A*= {0, 1} includes only two choices, corresponding to the option of accepting a new incoming request or to refuse it.the reward function

*R*=*P*−*C*combines the expected monetary profit (*P*) that should be paid by the SU with the cost*C*, which is proportional to the number of already leased channels.

### Spectrum-Aware Routing

The state set

*S*coincides with the set of SU nodes*N*_{SU}in the network.The action set

*A*_{i}is defined for each node*i*∈*N*_{SU}; more specifically, \(a_{j}^{(s,d)} \in A_{i}\) denotes the action of forwarding data toward next-hop*j*, where*s*and*d*are, respectively, the source and destination communication end-points.The reward \(R(i,a_{j}^{(s,d)})\) is a network metric reflecting the effectiveness for node

*i*of using*j*as next-hop node toward the destination*d*.

*i*maintains a

*table*of

*Q*-entries for each destination

*d*; the entry

*Q*

_{i}(

*j*,

*d*) is the

*expected delivery time*toward destination

*d*when using next-hop node

*j*. After forwarding a packet via node

*j*toward destination

*d*, node

*i*updates its

*Q*-Table as follows [63]:

*q*

_{i}is the time spent by the packet in the queue of node

*i*,

*δ*is the transmission delay on the

*i*−

*j*link, and min

_{z}

*Q*

_{j}(

*z*,

*d*) is the best delivery time at node

*j*and for destination

*d*. The same learning framework than Q-routing has also been adopted by several CR routing protocols, like [21, 65] and [66] although properly adapting the reward function to the CR scenario. In [65], the reward \(R(i,a_{j}^{(s,d)})\) is the per-link delay, which also takes into account the retransmissions caused by SU-PU collisions. In [21], the reward function is an estimation of the average channel available time, i.e., average OFF period of PUs interfering over the bottleneck link along the route from

*j*to

*d*. In addition, the authors of [21] investigate the performance of the proposed RL routing protocol on a real test-bed environment using USRP platforms; the experimental result demonstrates that the RL scheme provides better result than a greedy approach in terms of end-to-end metrics (i.e., throughput and packet delivery ratio). A multi-objective Q-routing scheme for CR networks is discussed in [66]; more specifically, the proposed algorithm aims to minimize the packet loss rate under desired constraint of transmission delay. The multi-objective is implemented by employing two rewards for each successful transmission (e.g., loss rate and delay) and by storing two separate Q-values at each node. The authors of [28] propose two RL-based spectrum-aware routing protocols for CR networks. Here, the

*Q*

_{i}(

*j*,

*d*) value denotes the number of available PU-free channels on the route from SU

*i*to SU

*d*via SU

*j*; the SU

*j*providing the highest Q-value is the preferred next-hop candidate. The Q-values are updated after each successful transmission, using a dual RL algorithm. In [67], the RL framework is used in order to properly tune the transmitting parameters of the popular AODV routing protocol. A different RL formulation for CR routing is proposed in [68], considering both the delay minimization requirement of each SU-SU flow and the interference minimization requirement of each SU-PU link and assuming no cooperation occurs among the SUs. The MDP is defined as follows:

The state set of SU

*i*is defined as the tuple: <*η*_{i}(*t*),*λ*_{1}(*t*),*λ*_{2}(*t*), …,*λ*_{|PU|}(*t*) > , where*η*_{i}(*t*) is the packet arrival rate of SU*i*and*λ*_{x}(*t*) is the packet transmission rate of PU*x*at time*t*.The action set of SU \(iA\,{=}\,\{a_0,NH_i^1,NH_i^2,..,NH_i^k,\}\) includes the no-forwarding action

*a*_{0}and the transmission toward next-hop node \(NH_i^j\).The reward function is equal to the delay experienced by packets flowing from SU

*i*to the destination node, in case the interference caused to PUs is kept below a safety threshold; it is set to a large value in case the no-forwarding action is selected or in case the interference caused to the PUs is higher than the threshold.

## Reinforcement Learning in Cognitive Radio Scenarios: Learning Methodology-Driven Taxonomy

*State representation*. The state of SUs is often modeled through a single discrete variable or a combination of discrete variables. However, the reviewed RL-CR studies differentiate on whether the state variables are*fully observable*by the SU or are only*partially observable*. In the first case, the state variable is expressed by parameters which are internal to the SU or by network conditions which can be measured by the DSA without perception errors. This is the case, for instance, of the MDP model proposed in [44], where each state takes into account both internal (i.e., the current distance from the PU) and network metrics (i.e., the aggregated interference caused by the secondary network). Vice versa, a minority of the cited studies takes into account the impact of perception errors on the network observation: we cite, for instance, the MDP proposed in [41] and [42] where the SU state is the belief that a given frequency is vacant, hence subject to the accuracy of the sensing scheme.*Model representation*. Almost the totality of the proposed RL-CR solutions employs*model-free*strategies with very few exceptions [68, 69], i.e., the agent does not keep any representation/estimation of the state transition and of the reward functions (the*T*and*R*functions in section “Overview of Reinforcement Learning”); rather, it updates the Q-table after each immediate reward through the popular Q-learning or Sarsa algorithms. This choice can be justified since, on several use-cases like DSA problems, the reward values are associated to network metrics (e.g., the actual throughput or the SNR) which are stochastically by nature and whose trends are hard to predict without having full knowledge of the network and channel conditions; moreover, both the state transition and the reward functions can dynamically vary in nonstationary environments due, e.g., to SU or PU mobility. For this reason, most of the works prefer to adjust the policy as a blind consequence of the received reward, instead of attempting to unveil the rules behind it. Some foundational results on this topic are provided in [70], where the authors investigate the relationship between the learning capabilities of the SUs in RL-DCS applications and the complexity of the PU pattern activity, measured through the Ziv-Lempel complexity metric. The experimental results demonstrate that, for specific levels of Ziv-Lempel complexity, the PU spectrum occupancy pattern can be learnt in an effective way by the SUs, hence justifying the utilization of model-based solutions.*Reward representation*. The modeling of the reward function clearly depends on the specific CR use-case. However, we can distinguish between two main approaches:*absolute*representation and*communication-aware*representation. In the first case, the reward is a scalar value, which can assume positive or negative values in order to encourage good actions or to penalize bad actions, but it is not related to any network metric. This is the case, for instance, of the RL-DCS scheme proposed in [29], where different rewards are introduced according to the outcome of each SU-SU transmission (i.e., successful, failed due to CRPU interference, failed due to CR-CR interference, failed due to channel errors). Vice versa, in communication-aware scheme, the reward takes into account node-related (e.g., the energy efficiency in [47]), link-related (e.g. the SNR in [43]) or network-related (e.g., the throughput in [20]) metrics. The clear advantage of this second approach is that the Q-table will converge over time to the actual system performance for the selected metric; at the same time, this might introduce additional protocol complexity, especially in presence of aggregated or cumulative metrics (e.g., the end-to-end path delay in Q-routing [63]).*Action Selection*. Strategies for action selection play a crucial role since they are in charge of balancing the exploration/exploitation phases, which in turn affect the performance of the RL-based solutions. Two main strategies have been employed in the RL-CR literature reviewed so far: the*Boltzmann rule*, which is based on Eq. 6 and relies on the temperature parameter*TE*in order to balance the exploration/exploitation phases, or the*ε*-*greedy rule*, which selects the optimal action with*ε*probability, while it performs random actions with 1 −*ε*probability. Both such strategies might guarantee adaptiveness to nonstationary environments; however, the way the*TE*and*ε*parameters are set and discounted over time is barely addressed, except for [23, 24]. More in details, [24] proposes an interesting self-evaluation mechanism which is added to a basic RL-DSA framework: each time the SU receives a predecided number of negative rewards in exploitation mode, it assumes that there has been a change in the environment and reacts by forcing an aggressive channel exploration phase.*Knowledge Sharing*. In several CR use-cases modeled through a MARL, the SUs can share learnt information in order to speed up the exploration phase or to implement distributed coordination mechanisms. From the analysis presented in section “Reinforcement Learning in Cognitive Radio Scenarios: Applications-Driven Taxonomy”, we can further classify the existing MARL-CR works in three major families: (i)*no-sharing*, (ii)*reward-based*, or (iii)*Q-table-based*. The first category includes all works where the SUs update their Q-table independently and without receiving any feedback from the other peers, although the instantaneous reward might depend from the joint action executed by the secondary network (e.g., the throughput in [30]). We include in this group also centralized approaches, where the global Q-table is managed by a network coordinator (e.g., the cognitive base station in [43]), or solutions where each SU keeps conjectures about the future behavior of the other SUs [32, 68]. The second category includes approaches like the docitive [44, 45] or payoff propagation paradigms [52, 52] where the SUs share the immediate rewards or rows of the Q-table. The received data are then merged with the local data, by using expertness measures controlling the knowledge transfer [46, 47] or action selection methods for achieving distributed coordination [48]. The impact of knowledge sharing on MARL-DCS problems is further investigated in section “Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks”.*Evaluation method*. Performance of RL-based solutions can be investigated through*simulation*studies,*testbeds*, or*analytical*models. The first two methods allow understanding the network performance gain introduced by RL techniques compared to non-learning approaches: at the best of our knowledge, [21, 71] and [72] are the only experimental studies in the literature. More specifically, [21] investigates the ability of a RL-enhanced routing protocol to select PU-free routes on a network environment consisting of ten USRP SU nodes, while [71] and [72] implement a RL-based DCS algorithm, respectively, over GNU-radio and USRP N210 platforms, and evaluate the way CR devices are able to learn the PU spectrum occupancy patterns. Both [21] and [71] confirm the effectiveness of RL-based solutions compared to state-of-the-art (non-ML-based) approaches. Analytical studies like [58] investigate the convergence of proposed RL algorithms to the optimal solution. Such theoretical results can be considered highly relevant from a pure scientific perspective but less practical in real-world network deployments, since the convergence property is assumed asymptotic and without accounting for the impact of exploration phase on the short-term system performance.

## Case Study: RL-Based Joint Spectrum Sensing/Selection Scheme for CR Networks

In this section, we describe an application of RL techniques to CR networks, in order to highlight gains and drawbacks of different RL algorithms and also to investigate the impact of learning parameters on the system performance. By referring to the taxonomy presented in section “Reinforcement Learning in Cognitive Radio Scenarios: Applications-Driven Taxonomy”, we consider here a joint spectrum sensing/selection (JSS) problem, in which a SU must learn the optimal channel where to transmit among the available frequencies and also the optimal balance between sensing and transmit actions on each channel. In section “System Model”, we introduce the system model and the problem goals. The problem is formulated by using the RL framework in section “RL-Based Problem Formulation”. Then, we evaluate the performance of RL-based solutions by neglecting the impact of secondary interference (section “Analysis I: SU-PU Interference Only”). Such assumption is removed in section “Analysis II: SU-PU and SU-SU Interference”.

### System Model

*N*couples of SUs operating within the same sensing domain. Each SU is equipped with a DSA transceiver, able to switch over

*K*frequencies of the licensed band, and over a common control channel (CCC) implemented in the unlicensed band. Each couple

*i*is formed of one SU transmitter (\(SU_{i}^{tx}\)) and one SU receiver (\(SU_{i}^{rx}\)). Data packets are transmitted over a licensed channel, while the signaling traffic is transmitted over the CCC. On each frequency

*f*

_{j}, there is an active PU which transmits according to an exponential ON/OFF distribution with parameters <

*α*

_{j},

*β*

_{j}> . Hence, frequency

*j*is vacant with a posteriori probability equal to \(\frac {\alpha _j}{\alpha _j+ \beta _j}\), while it is occupied by the PU with probability equal to \(\frac {\beta _j}{\alpha _j+ \beta _j}\). In addition, we model the packet error rate (PER) on each channel; let

*φ*

_{j}be the PER of channel

*f*

_{j}. Each \(SU_{i}^{tx}\) can implement three different time-slots:

*Sensing*slot, i.e., \(SU_{i}^{tx}\) senses the frequency to which it is currently tuned, in order to determine the PU presence. The sensing length is equal to*t*_{slot}. We assume a default energy-detection sensing scheme [3]: let*p*_{D}indicate the probability of correct detection and 1 −*p*_{D}the probability of sensing errors (including both false-positive and true negative instances).*Transmit*slot, i.e., \(SU_{i}^{tx}\) attempts transmitting exactly one packet to \(SU_{i}^{rx}\) by using a CSMA MAC scheme. In case the MAC ACK frame is not received, \(SU_{i}^{tx}\) retransmits the packet till a maximum number of attempts equal to*MAX*_*ATTEMPTS*. Otherwise, the packet is discarded.*Switch*slot, i.e., \(SU_{i}^{tx}\) switches to a different licensed frequency and communicates the new channel to the \(SU_{i}^{rx}\) on the CCC. Let*t*_{switch}represent the time overhead required for the handover.

*i*, where \(\tau _i^k=\{\)SENSE, TRANSMIT, SWITCH} and with \(T_i=\{\tau _i^0, \tau _i^1, \ldots \}\) the slot schedule of SU

*i*. Each SU

*i*can decide its own schedule

*T*

_{i}, but subject to these constraints: (i) if \(\tau _i^k\)=SENSE, and the channel is found busy, then \(\tau _i^{k+1}\)=SENSE or \(\tau _i^{k+1}\)=SWITCH, i.e., the SU can keep sensing or switching to a different channel, but it cannot perform a transmission and (ii) if \(\tau _i^k\)=SWITCH, then \(\tau _i^{k+1}\)=SENSE, i.e., the SU must sense the new channel in order to discover its availability. Similarly, let

*NTX*

_{i}be the total number of transmissions performed by SU

*i*(including the retransmissions). We denote with

*STX*

_{i}(

*l*) the outcome of the

*l*−th transmission (with 0 ≤

*l*<

*NTX*

_{i}) performed by SU

*i*. Based on the channel conditions, and on the SUs and PUs activities, the

*STX*

_{i}(

*l*) variable can assume one of these four values: (i)

*STX*

_{i}(

*l*)=OK if the transmission has been acknowledged by \(SU_{i}^{rx}\), (ii)

*STX*

_{i}(

*l*)=FAIL-PU-COLLISION if the transmission has failed due to collision with an active PU (i.e., PU is ON during the SU transmission); (iii)

*STX*

_{i}(

*l*)=FAIL-SU-COLLISION if the transmission has failed due to collision with other SU transmissions on the same channel; and (iv)

*STX*

_{i}(

*l*)=FAIL-CHERROR if the transmission has failed due to channel errors.The JSS problem can be formulated as the problem of determining the optimal schedule

*T*

_{i}of each SU

*i*, 0 ≤

*i*<

*N*, so that the total number of successful transmissions is maximized, while the probability to interfere with the PUs is kept below a predefined threshold (

*ψ*). More formally:

**JSS Problem**) Determine the optimal schedule

*T*

_{i}∀

*i*, 0 ≤

*i*<

*N*, so that:

∑

_{0≤i<N,0≤l<NTX(i)}*I*(*STX*_{i}(*l*) = OK) is maximized;\(\frac {\sum _{0 \leq i < N, 0 \leq l < NTX(i)} I(STX_i(l)=\mathtt {FAIL-PU-COLLISION})}{\sum _{0 \leq i < N}{NTX(i)}} > \psi \), where

*I*(⋅) is the identity function.

### RL-Based Problem Formulation

*K*=2. More in details:

- The set of states
*S*is the set of couples <*f*_{j}, {IDLE, BUSY, UNKNOWN} > where the first field is the frequency*f*_{j}∈*K*and the second field is the estimated availability of frequency*f*_{j}, based on the output of the sensing action. The set of actions

*A*= {SENSE, TRANSMIT, SWITCH} coincides with the slot types previously introduced.- The reward function
*R*:*S*×*A*→ [0 : 1] is defined in different ways according to the action implemented by \(SU_{i}^{tx}\). More specifically,*R*(<*f*_{j}, ⋅ >, SENSE) is set to 1 whether channel*f*_{j}is found BUSY, 0 otherwise. In case of transmit action,*R*(<*f*_{j}, IDLE >, TRANSMIT) is set as follows:$$\displaystyle \begin{aligned} R(<f_j, \mathtt{IDLE}>,\mathtt{TRANSMIT})= 1 - \frac{\#Retransmissions}{\mathtt{MAX\_ATTEMPTS}} \end{aligned} $$(14)where

*#Retransmissions*denotes the number of retransmissions performed. Hence, the reward is set to 1 whether the packet is acknowledged without any retransmission. Vice versa, it is set to 0 whether the packet is discarded since the maximum number of retransmission attempts has been reached. Finally, in case of channel switch, the reward*R*(<*f*_{j}, UNKNOWN >, SWITCH) is set to zero. - The transition function
*T*:*S*×*A*×*S*→ [0 : 1] is defined as follows, i.e.:$$\displaystyle \begin{aligned} \begin{array}{rcl}&\displaystyle T(<f_j, \cdot>,\mathtt{SENSE},<f_j, \mathtt{IDLE}>)= \frac{\beta_j}{\alpha_j+ \beta_j} \\ &\displaystyle T(<f_j, \cdot>,\mathtt{SENSE},<f_j, \mathtt{BUSY}>)= \frac{\alpha_j}{\alpha_j+ \beta_j} \\ &\displaystyle T(<f_j, \mathtt{IDLE}>,\mathtt{TRANSMIT},<f_j, \mathtt{TRANSMIT}>)=1\\ &\displaystyle T(<f_j, \cdot>,\mathtt{SWITCH},<f_k, \mathtt{UNKNOWN}>)=1 \\ &\displaystyle T(<f_j, \mathtt{UNKNOWN}>,\mathtt{SENSE},<f_j, \mathtt{IDLE}>)=\frac{\beta_j}{\alpha_j+ \beta_j} \\ &\displaystyle T(<f_j, \mathtt{UNKNOWN}>,\mathtt{SENSE},<f_j, \mathtt{BUSY}>)=\frac{\alpha_j}{\alpha_j+ \beta_j} \end{array} \end{aligned} $$For all the other input values, the transition function assumes output value equal to 0. In the equations above, we neglect the impact of channel sensing errors, and we assume that the next channel

*f*_{k}has already been determined. In any case, the*T*matrix is interesting only from the theoretical side, since in practice we assume that the SUs do not know its values.

**Q-Learning based**: each \(SU_{i}^{tx}\) stores a Q-table for each state/action and updates it after each TRANSMIT or SENSE action through Eq. 4. Moreover, at the end of slot*k*, \(SU_{i}^{tx}\) decides the next action through a probabilistic scheme. The probability of TRANSMIT or SENSE actions is set through Eq. 6, while the probability of a SWITCH action is computed as follows:Here, the first term of the$$\displaystyle \begin{aligned} \begin{array}{rcl} p(<f_j,\cdot>,\mathtt{SWITCH})&\displaystyle =&\displaystyle \mathrm{max} \{ \mathrm{max}_{0 \leq v < K, v \neq j} Q(<f_v, \mathtt{IDLE}>, \mathtt{TRANSMIT})\\ &\displaystyle -&\displaystyle Q(<f_j,\mathtt{IDLE}\ >, \mathtt{TRANSMIT}) , \theta \} \end{array} \end{aligned} $$(15)*max*operator denotes the maximum gain achievable when switching to a channel different from the current one (*f*_{j}), while the 0 ≤*θ*≤ 1 parameter indicates the probability of spectrum exploration. In case a SWITCH action is implemented, another probabilistic step is executed in order to select the channel: with probability*θ*, a random value is selected in range {0…*K*}; otherwise, the best channel is selected (the one equal to argmax_{0≤v<K,v ≠ j}*Q*(<*f*_{v}, IDLE >, TRANSMIT) −*Q*(<*f*_{j}, IDLE >, TRANSMIT). The values of the temperature*TE*(see Eq. 6) and of*γ*are set to a large initial value, and then progressively discounted at each slot in order to ensure convergence, but they cannot decrease below predefined minimum*TE*_{min}and*θ*_{min}values. We investigate the impact of the initial temperature value*TE*in section “Analysis I: SU-PU Interference Only”.**Sarsa based**: the scheme works similarly to the Q-learning except for the update rule of the TRANSMIT and SENSE actions, which is based on Eq. 5.**Information Dissemination Q-Learning based**(IDQ-Learning): the scheme works similarly to the Q-learning. In addition, each \(SU_{i}^{tx}\) shares the information about its state, action, and received reward, at each slot. All the \(SU_{j}^{tx}\),*j*≠*i*update their Q-values consequently as if the action was performed locally.

### Analysis I: SU-PU Interference Only

We modeled the CR network scenario and the RL-based algorithms through the NS2-CRAHN simulator described in [34]. Unless stated otherwise, we considered a scenario composed of 20 SU couples (i.e., *N*=20) and *K*= 6 licensed channels. The other parameters are *p*_{D}=10%, *MAX*_*ATTEMPTS*=7, *TE*_{ini}=50, *TE*_{min}=5, *θ*_{ini} = 80*%*, and *θ*_{min} = 10*%*. Each of the six channels exhibits different PU activity levels (PUL) and PER values, as reported in the table below.

We consider a constant bit rate (CBR) application; each \(SU_{i}^{tx}\) generates a new packet destined to \(SU_{i}^{rx}\) every 0.005 seconds. The packet length is 1000 bytes.

In this analysis, we assume that each SU does not interfere with other SUs tuned to the same channel. Hence, the goal of the learning algorithm is to identify vacant spectrum opportunities over the *K* channels. We compare the performance of the three RL-based schemes described in section “RL-Based Problem Formulation” with those of a non-learning scheme, named *Sequential* in the following. The protocol operations of the sequential scheme are straightforward: each SU senses the channel before any transmission attempt; in case the current channel *f*_{j} is detected as busy, the SU switches to channel *f*_{(j+1)%K}; otherwise, it transmits one packet and then senses the channel again.

*N*) on the

*x*-axis. It is easy to notice that all the RL-based algorithms greatly outperform the sequential scheme. No significant differences can be appreciated between Q-learning and Sarsa. Vice versa, IDQ-learning provides the highest throughput, and the gain produced by the cooperation becomes more evident when increasing the number of involved SUs. This result can be justified as follows: (i) the RL-based schemes estimate the quality of each channel and then concentrate most of the SU transmissions on channels 5, 2, and 4 characterized by favorable PUL and PER values and (ii) on these channels, the RL-based schemes reduce the amount of sensing actions while still guaranteeing satisfactory PU detection (see next results). In addition, compared to Q-learning and Sarsa, IDQ-learning guarantees better exploration and quicker convergence of all the SUs to the optimal state-action policy (which is equal for all the SUs). This is made evident in Fig. 8b, which shows the throughout over time, for a network scenario with

*N*=20. Till second 100, both Q-learning and Sarsa perform poorly because they are performing exploration, caused by the high values of the

*TE*and

*θ*parameters. After second 100, the throughput of both schemes sharply increases because they exploit more aggressively the learnt policy. In IDQ-learning, the impact of random actions is greatly mitigated since the exploration phase is shorter: at each round, each SU can receive

*N*different rewards and hence discounts more quickly the

*TE*and

*θ*parameters. At the same time, the exploration phase is more effective since all the SUs converge to the same policy guaranteeing the highest throughput. Figure 8c confirms the same trend of Fig. 8a, by showing the packet delivery ratio (PDR) of the four schemes for different values of

*N*. The PDR of the sequential, Q-learning and Sarsa schemes are not affected by

*N*since – in this analysis – we are neglecting the impact of SU channel contention. The PDR of the IDQ-learning increases with

*N*, again due to the positive impact of the SU cooperation.

*y*axis, we show the PU interference probability, defined as the rate of SU transmissions ending in state FAIL-PU-COLLISION over the total number of transmissions performed by the SUs. The Q-learning and Sarsa schemes guarantee a value which is comparable with the performance of the sequential scheme and in any case lower than 2%. The IDQ-learning exhibits a counterintuitive behavior: the risk of interference with PUs even reduces when increasing the number of potential interferers (i.e., the SUs), again thanks to the reward dissemination mechanisms, through which all the SUs converge to the optimal channel sequence and to the optimal balance between SENSE and TRANSMIT actions on each channel. To this purpose, Fig. 9b shows the average frequency rate of each action (different color bars) and on each channel (on the

*x*axis), experienced by the IDQ-learning (

*N*=20). It is easy to notice that our learning scheme (i) concentrates most of transmissions on channels 5 and 2 (which are the one most favorable toward the PUs in terms of PUL and PER values) and (ii) significantly reduces the frequency of sensing actions on channel 5, while it maximizes sensing on channels 0 and 3 (characterized by high PU activity). Hence, the graph confirms the ability of the IDQ-learning scheme to learn the optimal sequence of spectrum opportunities, and the amount of sensing on each channel, without knowing in advance the PER and PUL values. Finally, in Fig. 9c we investigate the impact of the initial temperature value (

*TE*

_{ini}) on the throughput (on the

*y*1 axis) and on the number of channel switches (on the

*y*2 axis). Again, we evaluate the IDQ-learning scheme with

*N*= 20. We can notice that there is an optimal

*TE*

_{ini}value (equal to 10 in our case) maximizing the throughput: when

*TE*

_{ini}< 10, the exploration phase is too short and hence the optimal policy cannot be discovered, vice versa when

*TE*

_{ini}≫ 10, the impact of suboptimal actions during exploration becomes significant. On the other side, the number of channel switches increases proportionally with

*TE*

_{ini}.

### Analysis II: SU-PU and SU-SU Interference

**Distributed Q-Learning**(DistQ-Learning): the protocol works similarly to the Q-learning scheme described in section “RL-Based Problem Formulation”, with two significant differences. First, each time \(SU_{i}^{tx}\) performs a TRANSMIT action on a given channel; it computes a local reward \(r_i^L=1 - \frac {\#Retransmissions}{\mathtt {MAX\_ATTEMPTS}}\) and shares it with all the other SUs. By averaging the received \(r_j^L\) values,*j*≠*i*, each \(SU_{i}^{tx}\) computes the average network reward \(r^G=\frac {\sum _{0\leq i < N} r_i^L}{N}\), which is a proxy for the network throughput. Second, once computing the*r*^{G}value, each \(SU_{i}^{tx}\) updates the Q-table for the TRANSMIT action on channel*f*_{j}by following the FMQ rule [27], i.e.:where \(r^G_{\mathrm {max}}(f_j)\) is the maximum global reward observed when \(SU_{i}^{tx}\) is tuned to channel$$\displaystyle \begin{aligned} \begin{array}{rcl} {} Q(<f_j, \mathtt{IDLE}>, \mathtt{TRANSMIT})&\displaystyle =&\displaystyle Q(<f_j, \mathtt{IDLE}>, \mathtt{TRANSMIT})\\ &\displaystyle +&\displaystyle \frac{C^i_{\mathrm{max}}(r^G_{\mathrm{max}}, f_j)}{C^i(f_j)} \cdot r^G_{\mathrm{max}}(f_j) \end{array} \end{aligned} $$(16)*f*_{j}, \(C^i_{\mathrm {max}}(r^G_{\mathrm {max}}, f_j)\) is the number of times such values has been observed, and*C*^{i}(*f*_{j}) is the total number of transmission attempts on frequency*f*_{j}. As a result, each SUs pushes its policy toward channels where an optimal network reward*r*^{G}is achieved, although no SU keeps track on the global MARL Q-table nor it makes conjectures about the opponents’ behaviors; the*r*^{G}value reflects indirectly the optimality of the joint action performed by the other SUs, and based on it each SU adjusts its own policy.

*K*= 9 instead of

*K*= 6); the channels 7-8-9 have the same PUL-PER profiles of channels 3-4-5 (see Table 1). Figure 10a shows the network throughput when varying the number of transmitting SUs (

*N*). The throughput values are lower than Fig. 8a and also decrease with

*N*as a consequence of the SU contention on each channel. We can notice that the DistQ-learning scheme provides significantly better performance than the sequential scheme but also than the IDQ-learning scheme. Using this latter, all the SUs attempt to discover the same policy, i.e., they transmit on the same channels and balance TRANSMIT and SENSE actions in the same way. Vice versa, the DistQ-learning aim at achieving implicit coordination among SUs through Eq. 16; the SUs learn differentiated policies – at least regarding the Q-value of the TRANSMIT action on each channel – so that the maximum, network-wide reward can be achieved. This is also visible in Fig. 10b which shows the network throughput on each of the

*K*channels, for the three different algorithms and

*N*= 20. The IDQ-learning scheme concentrates most of the SU transmissions on channel 8 and 4 (which are the most favorable to SUs for PUL/PER profiles) but clearly increasing the contention level on those frequencies and hence the risk of packet losses due to SU-SU collisions. Vice versa, the DistQ-learning scheme achieves better distribution of the SUs over the available spectrum opportunities, which also translates into enhancements in terms of PDR, as depicted in Fig. 10c.

PUL/PER profiles of the *K* = 5 licensed channels

Channel index | PUL | PER | PUL profile: < | PER |
---|---|---|---|---|

0 | High | Medium | < 10,2> | 50% |

1 | Medium | Medium | < 5,5> | 50% |

2 | Low | Medium | < 2,10> | 50% |

3 | High | Low | < 10,2> | 10% |

4 | Medium | Low | < 5,5> | 10% |

5 | Low | Low | < 2,10> | 10% |

## Conclusions and Open Issues

*Accurate performance evaluation in real-world CR network scenarios*. Despite few experimental works [68, 71, 72], the evaluation of RL-based solutions have been mainly conducted through simulation studies, in which the spectrum occupancy pattern is modeled by using well-known probability distribution (like the exponential [4] or the Bernoulli [56] distributions). However, spectrum bands might exhibit different complexity of the PU occupancy pattern, based the signal waveform and regulations of licensed users transmitting on those frequency [70]. Hence, differentiated RL learning algorithms (e.g., by considering model-free or model-based approaches, MDP or POMDP frameworks) can be tested and deployed on different frequencies, also based on the predictability of the spectrum availability. Additional works based on real-world spectrum traces are required in order to address this issue.*Analysis of RL techniques for CR applications with strict QoS network requirements*. Several multimedia applications pose strict QoS requirements (e.g., the maximum packet drop rate or jitter for video-streaming services) that must be continuously met by the network, in order not to affect negatively the users’ experience. In RL-CR solutions, the SUs must continuously balance the exploitation and exploration phases during the system lifetime: when the SU selects random, possibly suboptimal actions, the QoS requirements of the CR multimedia applications are not guaranteed. Proper techniques that provide effective state-action exploration without causing detectable performance degradation for CR applications with strict QoS network requirements have not designed so far.*Enhancement of the learning framework*. Most of the reviewed works in the RL-CR literature provide an accurate modeling of the CR network scenario, including the operations of the main actors (PUs and SUs); the same level of complexity cannot be found on the learning part, since in most cases, the RL framework consists in a straightforward application of model-free algorithms (Q-learning or Sarsa algorithms overall). However, the RL theory is vast and is not limited to such results [12]; moreover, it is continuously extended by novel contributions coming from an active research community [73]. CR networking can benefit from the novel RL architectures introduced so far: we cite, among others, the utilization of deep RL techniques (i.e., RL framework enhanced with artificial neural networks for state-action representation) for better coordination in distributed scenarios [74] or for more efficient exploration [75].

## Footnotes

## References

- 1.Akyildiz IF, Lee WY, Vuran MC, Mohanty S (2006) NeXt generation/dynamic spectrum access/cognitive radio wireless networks: a survey. Comput Netw J 50(1):2127–2159Google Scholar
- 2.Mitola J (2000) Cognitive radio an integrated agent architecture for software defined radio. PhD Dissertation, KTH StockholmGoogle Scholar
- 3.Yucek T, Arslan H (2009) A survey of spectrum sensing algorithms for cognitive radio applications. J IEEE Commun Surv Tutor 11(1):116–130Google Scholar
- 4.Lee WY, Akyildiz I (2008) Optimal spectrum sensing framework for cognitive radio networks. IEEE Trans Wirel Commun 7(10):3845–3857Google Scholar
- 5.Sherman M, Mody AN, Martinez R, Rodriguez C, Reddy R (2008) IEEE standards supporting cognitive radio and networks, dynamic spectrum access, and coexistence. IEEE Commun Mag 46(7):72–79Google Scholar
- 6.Flores AB, Guerra RE, Knightly EW (2013) IEEE 802.11af: a standard for TV white space spectrum sharing. IEEE Commun Mag 51(10):92–100Google Scholar
- 7.Clancy C, Hecker J, Stuntbeck E, OShea T (2007) Applications of machine learning to cognitive radio networks. IEEE Wirel Commun 14(4):47–52Google Scholar
- 8.Mitchell T (1997) Machine learning. McGraw Hill, New YorkGoogle Scholar
- 9.Gavrilovska L, Atanasovksi V, Macaluso I, DaSilva L (2013) Learning and reasoning in cognitive radio networks. IEEE Commun Surv Tutor 15(4):1761–1777Google Scholar
- 10.Bkassiny M, Li Y, Jayaweera SK (2013) A survey on machine-learning techniques in cognitive radios. IEEE Commun Surv Tutor 15(3):1136–1159Google Scholar
- 11.Wang W, Kwasinksi A, Niyato D, Han Z (2016) A survey on applications of model-free strategy learning in cognitive wireless networks. IEEE Commun Surv Tutor 18(3):1717–1757Google Scholar
- 12.Barto AG, Sutton R (1998) Reinforcement learning: an introduction. MIT Press, CambridgeGoogle Scholar
- 13.Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4(1):237–285Google Scholar
- 14.Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern 38(2):156–171Google Scholar
- 15.Busoniu L, Babuska R, De Schutter B (2006) Multi-agent reinforcement learning: a survey. In: Proceedings of IEEE ICARCV, SingaporeGoogle Scholar
- 16.Watkins CJ, Dayan P (1992) Technical note: Q-learning. Mach Learn 8(1):279–292Google Scholar
- 17.Rummery GA, Niranjan M (1994) Online Q-learning using connectionist systems. Technical ReportGoogle Scholar
- 18.Di Felice MK, Wu C, Bononi L, Meleis W (2010) Learning-based spectrum selection in cognitive radio ad hoc networks. In: Proceedings of IEEE/IFIP WWIC, LuleaGoogle Scholar
- 19.Yau KLA, Komisarczuk P, Teal PD (2012) Reinforcement learning for context awareness and intelligence in wireless networks: review, new features and open issues. J Netw Comput Appl 35(1):235–267Google Scholar
- 20.Yau KLA, Komisarczuk P, Teal PD (2010) Applications of reinforcement learning to cognitive radio networks. In: Proceedings of IEEE ICC, CapetownGoogle Scholar
- 21.Raza Syed A, Alvin Yau KL, Qadir J, Mohamad H, Ramli N, Loong Keoh S (2016) Route selection for multi-hop cognitive radio networks using reinforcement learning: an experimental study. In: Proceedings of IEEE access 4(1):6304–6324Google Scholar
- 22.Vucevic N, Akyildiz IF, Romero JP (2010) Cooperation reliability based on reinforcement learning for cognitive radio networks. In: Proceedings of IEEE SDR, BostonGoogle Scholar
- 23.Jiang T, Grace D, Mitchell PD (2011) Efficient exploration in reinforcement learning-based cognitive radio spectrum sharing. IET Commun 5(10):1309–1317Google Scholar
- 24.Ozekin E, Demirci FC, Alagoz F (2013) Self-evaluating reinforcement learning based spectrum management for cognitive ad hoc networks. In: Proceedings of IEEE ICOIN, BangkokGoogle Scholar
- 25.Macaluso I, DaSilva L, Doyle L (2012) Learning Nash equilibria in distributed channel selection for frequency-agile radios. In: Proceedings of IEEE ECAI, MontpellierGoogle Scholar
- 26.Lall S, Sadhu AK, Konar A, Mallik KK, Ghosh S (2016) Multi-agent reinforcement learning for stochastic power management in cognitive radio network. In: Proceedings of IEEE Microcom, DurgapurGoogle Scholar
- 27.Kapetanakis S, Kudenko D (2002) Reinforcement learning of coordination to cooperative multi-agent systems. In: Proceedings of AAAI, Menlo ParkGoogle Scholar
- 28.Wahab B, Yang Y, Fan Z, Sooriyabandara M (2009) Reinforcement learning based spectrum-aware routing in multi-hop cognitive radio networks. In: Proceedings of IEEE CROWNCOM, HannoverGoogle Scholar
- 29.Chowdhury K, Wu C, Di Felice M, Meleis W (2010) Spectrum management of cognitive radio using multi-agent reinforcement learning. In: Proceedings of IEEE AAMAS, TorontoGoogle Scholar
- 30.Faganello LR, Kunst R, Both CB (2013) Improving reinforcement learning algorithms for dynamic spectrum allocation in cognitive sensor networks. In: Proceedings of IEEE WCNC, ShanghaiGoogle Scholar
- 31.Wu Y, Hu F, Kumar S, Zhu Y, Talari A, Rahnavard N, Matyjas JD (2014) A learning-based QoE-driven spectrum handoff scheme for multimedia transmissions over cognitive radio networks. IEEE J Sel Areas Commun 32(11):2134–2148Google Scholar
- 32.Chen X, Zhao Z, Zhang H (2013) Stochastic power adaptation with multiagent reinforcement learning for cognitive wireless mesh networks. IEEE Trans Mob Comput 12(11):2155–2166Google Scholar
- 33.Zhou P, Chang Y, Copeland JA (2010) Learning through reinforcement for repeated power control game in cognitive radio networks. In: Proceedings of IEEE Globecom, MiamiGoogle Scholar
- 34.Di Felice M, Chowdhury K, Kim W, Kassler A, Bononi L (2011) End-to-end protocols for cognitive radio ad hoc networks: an evaluation study. Perform Eval (Elsevier) 68(9):859–875Google Scholar
- 35.Reddy YB (2008) Detecting primary signals for efficient utilization of spectrum using Q-learning. In: Proceedings of IEEE ITNG, Las VegasGoogle Scholar
- 36.Berhold U, Fu F, Van Der Schaar M, Jondral FK (2008) Detection of spectral resources in cognitive radios using reinforcement learning. In: Proceedings of IEEE Dyspan, pp 1–5Google Scholar
- 37.Di Felice M, Chowdhury KR, Kassler A, Bononi L (2011) Adaptive sensing scheduling and spectrum selection in cognitive wireless mesh networks. In: Proceedings of IEEE Flex-BWAN, MauiGoogle Scholar
- 38.Arunthavanathan S, Kandeepan S, Evans RJ (2013) Reinforcement learning based secondary user transmissions in cognitive radio networks. In: Proceedings of IEEE Globecom, AtlantaGoogle Scholar
- 39.Mendes AC, Augusto CHP, da Silva MWR, Guedes RM, de Rezende JF (2011) Channel sensing order for cognitive radio networks using reinforcement learning. In: Proceedings of IEEE LCN, BonnGoogle Scholar
- 40.Lo BF, Akyldiz IF (2010) Reinforcement learning-based cooperative sensing in cognitive radio ad hoc networks. In: Proceedings of IEEE PIMRC, IstanbulGoogle Scholar
- 41.Lunden J, Kulkarni SR, Koivunen V, Poor HV (2011) Exploiting spatial diversity in multiagent reinforcement learning based spectrum sensing. In: Proceedings of IEEE CAMSAP, San JuanGoogle Scholar
- 42.Lunden J, Kulkarni SR, Koivunen V, Poor HV (2013) Multiagent reinforcement learning based spectrum sensing policies for cognitive radio networks. IEEE J Sel Top Signal Process 7(5):858–868Google Scholar
- 43.Jao Y, Feng Z (2010) Centralized channel and power allocation for cognitive radio network: a Q-learning solution. In: Proceedings of IEEE FNMS, FlorenceGoogle Scholar
- 44.Galindo-Serrano A, Giupponi L, Blasco P, Dohler M (2010) Learning from experts in cognitive radio networks: the docitive paradigm. In: Proceedings of IEEE CROWNCOM, CannesGoogle Scholar
- 45.Galindo-Serrano A, Giupponi L (2010) Distributed Q-learning for aggregated interference control in cognitive radio networks. IEEE Trans Veh Tech 59(4):1823–1834Google Scholar
- 46.Chowdhury KR, Di Felice M, Doost-Mohammady R, Meleis W, Bononi L (2011) Cooperation and communication in cognitive radio networks based on TV spectrum experiments. In: Proceedings of IEEE WoWMoM, LuccaGoogle Scholar
- 47.Emre M, Gur G, Bayhan S, Alagoz F (2015) CooperativeQ: energy-efficient channel access based on cooperative reinforcement learning. In: Proceedings of IEEE ICCW, LondonGoogle Scholar
- 48.Saad H, Mohamed A, ElBatt T (2012) Distributed cooperative Q-learning for power allocation in cognitive femtocell networks. In: Proceedings of IEEE VTC-Fall, Quebec CityGoogle Scholar
- 49.Venkatraman P, Hamdaoui B, Guizani M (2010) Opportunistic bandwidth sharing thorough reinforcement learning. IEEE Trans Veh Tech 59(6):3148–3153Google Scholar
- 50.Bernardo F, Augusti R, Perez-Romero J, Sallent O (2010) Distributed spectrum management based on reinforcement learning. In: Proceeding of IEEE CROWNCOM, HannoverGoogle Scholar
- 51.Yau KLA, Komisarczuk P, Teal PD (2010) Context-awareness and intelligence in distributed cognitive radio networks: a reinforcement learning approach. In: Proceedings of IEEE AusCTW, CanberraGoogle Scholar
- 52.Yau KLA, Komisarczuk P, Teal PD (2010) Enhancing network performance in distributed cognitive radio networks using single-agent and multi-agent reinforcement learning. In: Proceedings of IEEE LCN, DenverGoogle Scholar
- 53.Yau KLA, Komisarczuk P, Teal PD (2010) Achieving context awareness and intelligence in distributed cognitive radio networks: a payoff propagation approach. In: Proceedings of IEEE WAINA, SingaporeGoogle Scholar
- 54.Kakalou I, Papadimitriou GI, Nicopoliditis P, Sarigiannidis PG, Obaidat MS (2015) A reinforcement learning-based cognitive MAC protocol. In: Proceedings of IEEE ICC, LondonGoogle Scholar
- 55.Agrawal R (1995) Sample mean based index policies with o(log(n)) regret for the multi-armed bandit problem. Adv Appl Prob 27(1):1054–1078Google Scholar
- 56.Robert C, Moy C, Wang CX (2014) Reinforcement learning approaches and evaluation criteria for opportunistic spectrum access. In: Proceeding of IEEE ICC, SydneyGoogle Scholar
- 57.Jouini W, Di Felice M, Bononi L, Moy C (2012) Coordination and collaboration in secondary networks: a multi-armed bandit based framework. In: Technical Report. Available at: https://arxiv.org/abs/1204.3005
- 58.Li H (2010) Multi-agent Q-learning for competitive spectrum access in cognitive radio systems. In: Proceedings of IEEE SDR, BostonGoogle Scholar
- 59.Alsarhan A, Agarwal A (2010) Resource adaptations for revenue optimization in cognitive mesh network using reinforcement learning. In: Proceedings of IEEE GLOBECOM, MiamiGoogle Scholar
- 60.Teng Y, Zhang Y, Niu F, Dai C, Song M (2010) Reinforcement learning based auction algorithm for dynamic spectrum access in cognitive radio networks. In: Proceedings of IEEE VTC Fall, OttawaGoogle Scholar
- 61.Cesana M, Cuomo F, Ekici E (2011) Routing in cognitive radio networks: challenges and solutions. Ad Hoc Netw (Elsevier) 9(3):228–248Google Scholar
- 62.Chowdhury KM, Di Felice (2009) SEARCH: a routing protocol for mobile cognitive radio ad-hoc networks. Comput Commun (Elsevier) 32(18):1983–1997Google Scholar
- 63.Litman M, Boyan J (1994) Packet routing in dynamically changing networks: a reinforcement learning approach. Adv Neural Inform Process Syst 7(1):671–678Google Scholar
- 64.Chetret D, Tham C, Wong L (2004) Reinforcement learning and CMAC-based adaptive routing for MANETs. In: Proceedings of IEEE ICON, SingaporeGoogle Scholar
- 65.Al-Rawi AHA, Alvin Yau KL, Mohamad H, Ramli N, Hashim W (2014) A reinforcement learning-based routing scheme for cognitive radio ad hoc networks. In: Proceedings of IEEE WMNC, VilamouraGoogle Scholar
- 66.Zheng K, Li H, Qiu RC, Gong S (2012) Multi-objective reinforcement learning based routing in cognitive radio networks: walking in a random maze. In: Proceedings of IEEE ICNC, MauiGoogle Scholar
- 67.Safdar T, Hasbulah HB, Rehan M (2015) Effect of reinforcement learning on routing of cognitive radio ad hoc networks. In: Proceedings of IEEE ISMSC, IponGoogle Scholar
- 68.Pourpeighhambar B, Dehghan M, Sabaei M (2017) Non-cooperative reinforcement learning based routing in cognitive radio networks. Comput Commun (Elsevier) 106(1):11–23Google Scholar
- 69.Dowling J, Curran E, Cunningham R, Cahill V (2005) Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing. IEEE Trans Syst Man Cybern 35(3):360–372Google Scholar
- 70.Macaluso I, Finn D, Ozgul BAL, DaSilva (2013) Complexity of spectrum activity and benefits of reinforcement learning for dynamic channel selection. IEEE J Sel Areas Commun 31(11):2237–2246Google Scholar
- 71.Ren Y, Dmochowski P, Komisarczuk P (2010) Analysis and implementation of reinforcement learning on a GNU radio cognitive radio platform. In: Proceedings of IEEE CROWNCOM, CannesGoogle Scholar
- 72.Moy C, Nafkha A, Naoues M (2015) Reiforcement learning demonstrator for opportunistic spectrum access on real radio signals. In: Proceedings of IEEE DySPAN, StockholmGoogle Scholar
- 73.Dayan P, Niv Y (2008) Reinforcement learning: the good, the bad and the ugly. Curr Opin Neurobiol 18(1):1–12Google Scholar
- 74.Naparstek O, Cohen K (2017) Deep multi-user reinforcement learning for distributed dynamic spectrum access. In: CoRR abs/1704.02613Google Scholar
- 75.Ferreira VP, Paffenroth R, Wyglinski RMA, Hackett MT, Bilen GS, Reinhart CR, Mortense JD (2017) Multi-objective reinforcement learning-based deep neural networks for cognitive space communications. In: Proceedings of IEEE CCAA, ClevelandGoogle Scholar