Q-Learning-Based Dynamic Spectrum Access in Cognitive Industrial Internet of Things

In recent years, Industrial Internet of Things (IIoT) has attracted growing attention from both academia and industry. Meanwhile, when traditional wireless sensor networks are applied to complex industrial field with high requirements for real time and robustness, how to design an efficient and practical cross-layer transmission mechanism needs to be fully investigated. In this paper, we propose a Q-learning-based dynamic spectrum access method for IIoT by introducing cognitive self-learning technical solution to solve the difficulty of distributed and ordered self-accessing for unlicensed terminals. We first devise a simplified MAC access protocol for unlicensed users to use single available channel. Then, a Q-learning-based multi-channels access scheme is raised for the unlicensed users migrating to other lower cells. The channel with most Q value will be considered to be selected. Every mobile terminals store and update their own channel lists due to distributed network mode and non-perfect sensing ability. Numerical results are provided to evaluate the performances of our proposed method on dynamic spectrum access in IIoT. Our proposed method outperforms the traditional simplified accessing methods without self-learning capability on channel usage rate and conflict probability.


Introduction
In the context of Industry 4.0, Industrial Internet of Things (IIoT) provides new driving force for the development of high efficient, low-energy, flexible and smart factories, by introducing sensing capability, cloud computing, intelligent robotics and wireless sensor networks into modern industrial environment [1][2][3][4].An inevitable tendency toward global mobile networks that combines artificial intelligent, automation, warehousing systems and production facilities in the shape of Cyber-Physical Systems as well as cognitive IIoT emerges [5][6][7].
In the process of continuously sensing industrial field, exchanging control information, self-learning and adapting dynamic networks, deciding and performing transmission strategy, plenty of challenges need to be addressed.Many techniques including intelligent algorithms, deep learning, cognitive radio have been applied to enhance the robustness, accuracy and efficiency of the IIoT [8][9][10][11].In particular, many technical solutions which have been adopted in wireless sensor networks can be adapted to IIoT environment after being revised according to the corresponding new characteristics of IIoT [12][13][14][15][16][17].In [12], a wireless sensor networks based on safe navigation scheme for micro flying robots in the IIoT has been raised to detect the static and dynamic obstacles in indoor environment.In [13], a three-stage multi-view stacking ensemble machine learning model based on hierarchical time series feature extraction methods were designed to resolve the anomaly detection problem in IIoT.In [14], a multi-level DDoS mitigation framework were devised to defend against DDoS attacks for IIoT, which includes the edge computing level, fog computing level, and cloud computing level.In [15], the authors developed an IIoT based solution to ensure a real-time connection between products and assembly lines.The raised dynamic cycle time setting method considered the varying complexity of the product on the basis of the real-time information offered by sensor nodes and indoor positioning systems.
Due to the high criterion for reliability and robustness of measurement system in industrial field, the network structure of common wireless sensor networks should be refined to fit in IIoT [18][19][20].Too many tiers in IIoT will increase complexity of protocol management and hardware design, yet too few ters constrict network's flexibility and application areas.Besides, with the increase of network nodes deployed in the IIoT, how to improve network efficiency and system capacity still needs to be deeply investigated so far.In [21], the authors presented a threefactor user authentication protocol for wireless sensor networks to overcome the weakness of other traditional protocols.The proposed protocol is robust and energy efficient for IoT applications.In [22], to solve the security challenges, the authors explored the consortium blockchain technology to raise a secure energy trading system denoted as energy blockchain.Besides, a credit-based payment scheme to support fast and frequent energy trading energy trading.In [23], the authors proposed a securing IIoT, a practical authorization framework on annotated metadata for securing IIoT objects.The method supports multi-dimension and large data processing with flexible and efficient authorization model to meet new security requirements for IIoT.In [24], the study designed a resilient section selection mechanism of power fingerprinting applied to device load recognition, so as to determine the transmission time and select the power fingerprinting section to be resiliently transferred.Furthermore in [25], the authors used an energy-efficient architecture for IIoT, which involves a sense entities domain where huge amounts of energy are consumed by a tremendous number of nodes.Besides, many techniques applied in other networks have been referenced for solving the relevant problems in IIoT [21,[26][27][28][29][30][31][32].
In this paper, we propose a Q-learning-based dynamic spectrum access strategy for IIoT to improve the spectrum efficiency and degrade access conflict.In IIoT, with the increase of system nodes and network complexity, how to devise an efficient MAC protocol and network structure to adapt the new characteristics of IIoT becomes significant.We consider a multitiered heterogeneous IIoT with lower mesh networks where numbers of small cells perform spectrum sharing strategy to enhance spectrum efficiency for IIoT.Furthermore, in this work, we assume the spectrum sensing ability of the sensor nodes in IIoT is not perfect and all the lower nodes are incorporated in distributed mode, thus self-learning function should be deeply exploited to dynamically access the sharing channels.We first design a self-learning-based MAC protocol for the lower-tier sensor users in IIoT.Then, when plenty of channels and unlicensed users need to competitively access the limited channels, a Qlearning-based spectrum access method is proposed.In the process of channel selection, unlicensed users will choose the channel with most Q value by using Q-learning.
The main contribution of this paper can be highlighted as follows.
-A deep learning method is introduced to dynamic spectrum access in IIoT after taking the complex multitiered structure of industrial network field.-A distributed dynamic spectrum access strategy is raised in this paper to decrease conflict probability in mesh-networks-based IIoT.-Numerical results are provided in this paper to testify the performances of our proposal.Comparison tests are performed to present the conflict probability and channel usage.
The remainder of this paper is organized as follows.We introduce the system model for dynamic spectrum access in IIoT in Section 2. Section 3 gives the details of our deep learning method.Furthermore, numerical results are supplied to analyze the performance of the spectrum access strategy in Section 4. Finally, we conclude this paper in Section 5.

System model
According to the characteristics of Industrial Internet of Things (IIoT), we consider to adopt two-tier architecture in this paper as shown in Fig. 1.Based on the specific situation of the control object in industrial field and the relation of different kinds of industrial devices, the wireless sensor nodes installed in these devices need to be appropriately arranged.In this case, we consider the sensor nodes in the lower tier form several mesh networks and the cluster heads in the upper tier construct mesh networks or star topology networks.In lower tier mesh networks, sensor nodes within one mesh network communicate with the corresponding cluster head in mode of single hop, and the cluster heads connect with each other under the protocol of wireless local area networks (WLAN).Generally, sensor nodes contacting with the other nodes within one mesh networks do not need to send any information to corresponding cluster head.The periodic data transmission of detecting tasks in industrial field is assumed to mainly occur in same cluster.When the transmission across different mesh networks is required, the cluster head will relay the signal through upper WLAN protocol to another cluster.
Since most of the transmission tasks involved in the industrial field focus on data acquisition and exchange for each cluster, we suppose the sensor nodes in one cluster constitute a small cell.However, when the cell number is growing, spectrum sharing mechanism is required to be applied for band saving and improving spectrum efficiency.Thus, the sensor cells locating far away can share same band to perform spectrum reuse.At this time, when sensor nodes in the IIoT environment move across various lower cells, proper dynamic spectrum access scheme should be devised to avoid severe internet interference.
In the process of designing dynamic spectrum access strategy in this heterogeneous IIoT, the following two characteristics should be taken into account.
1. Limited spectrum sensing ability of sensor nodes which means a node cannot identify accurately that the current spectrum occupying is caused by licensed devices within same cell or other devices migrating from adjacent cells.2. The sensor nodes in the lower tier form a mesh network which means they are working in a distributed mode.3. The sensor nodes cannot gather all the required information from a central controller due to the distributed communication mode.
In this situation, we consider to adopt the intelligent characteristic of sensor nodes in IIoT, based on Q-Learning algorithm and memorable cognitive MAC protocol, to propose a distributed multi-channel dynamic spectrum access strategy.
For the dynamic spectrum access in heterogeneous mesh networks of IIoT, due to distributed structure, we assume the spectrum sensing capability of the sensor nodes is not perfect and cannot obtain all the essential information of other nodes from cluster head acting as a central controller.Thus, in this case, we assume the sensor nodes are intelligent devices with self-study ability to adapt the dynamic spectrum circumstance and select proper channel to access.
In our system model, we assume the node working at its own cell as the licensed user and the node migrating to another mesh cell as the unlicensed user.Hence, the unlicensed users should dynamically access the spectrum in this IIoT.During this process, the strategy of licensed users is that whenever they have the demand of transmitting packet, they can initiate their transmission immediately without any consideration of other unlicensed sensor nodes.
The spectrum access strategy of unlicensed terminals is to adopt a slot-memorized MAC protocol which appoints a transmission probability for every potential status in one slot.Thus, we have the function f : y s → [0, 1], where y s denotes the status set which can be expressed as y s = {idle, busy, success, f ailure}.The unlicensed user with status of y ∈ y s in previous slot can transmit data in probability f (y) at present slot.
We take the non-invasive protocol and fairness definition into account when designing the MAC protocol.
Non-invasive protocol: If f (busy) = 0, then the spectrum access is non-invasive.If a unlicensed terminal complies with non-invasive protocol, it should wait in the slot which follows a busy slot.Hence, the non-invasive protocol makes the licensed users once succeed in setting up transmission will not be disturbed by unlicensed users.
Fairness: Define a fairness level θ ∈ (0, 1].Suppose no licensed transmission available, once a unlicensed user succeeds in spectrum access, the probability of successful transmission for this user at the next slot can be denoted as where p success is the probability of successful transmission, f (success) denotes the probability of successful channel access, f (busy) denotes the probability of busy channel status and N denotes the slot number.Then, the average number of continuous transmission for the unlicensed user is When licensed users do not have data to transmit, the average number of unlicensed user's continuous transmission can be denoted to be 1/θ , we define the fairness level θ as With the decrease of fairness level, an unlicensed user will have an opportunity to increase the time using current channel after it performs a successful transmission, which makes other unlicensed users wait longer time to access this channel.
The cognitive MAC protocol with memory function pays attention on the non-invasive mode which provides priority for the licensed users.The fairness level of a spectrum access protocol can be expressed by Eq. 3, then combining the definition of non-invasive protocol and Eq. 3, we have The other factors f (idle) and f (f ailure) can be denoted by q and r, respectively.The MAC protocol with memory function in fairness level θ can be depicted as

Q-learning-based dynamic spectrum access
The cognitive MAC protocol with memory function can overcome the limitation of spectrum sensing in physical layer, and increase the channel utilization efficiency.However, when the number of channel and user increases obviously, how to perform more efficient spectrum access scheme in IIoT based on the MAC protocol and user's self-learning ability, still needs to be solved.To describe the dynamic spectrum access in IIoT in detail, we propose a Q-Learning-based access algorithm whose mechanism can be presented in the following Fig. 2.
In this paper, our proposal focuses on the multiply channels environment and fully takes the sensor nodes' intelligent characteristics into account by using Q-Learningbased method.Then, combining the MAC protocol with memory function, sensor nodes can dynamically access the idle channels in IIoT in spectrum sharing mode.
It should be noted that unlicensed users (the users migrating to other lower cells) need to find a solution to avoid the fluent appearance of licensed users (the users working in their own cells).In this work, we introduce a new index to perform the judgement of whether licensed users emerge.This index is the number of BUSY status of unlicensed user i at channel j .We use b ij to denote the index.
At the beginning of every slot, unlicensed users will first analyze all the available channels currently and judge whether the selected channel's BUSY number exceeds the given threshold th B .In this case, we set threshold th B to provide a detailed tolerance level for the unlicensed users to decide how many slots should be waited and evade when licensed users emerge.The analytical process can be presented as follows.
If the number of BUSY status does not exceed threshold th B , the unlicensed user considers there is no licensed user on current channel.Then, spectrum access can be allowed for the unlicensed user.Meanwhile, if the previous slot's status is BUSY, the sensor node should update the BUSY number of this channel as b ij = b ij + 1.
If current channel's BUSY number exceeds threshold th B , users can determine that the licensed user is very likely to be appear at this channel.So, the unlicensed user should evade to avoid severe interference and enter the process of channel selection again.
The main parts of this algorithm include two sections: channel selection and channel access.In this paper, our spectrum selection process is based on Q-Learning algorithm and unlicensed users can receive the most delay award from the optimal channel selection strategy.The principle of our channel selection strategy lies in that the unlicensed users improve their channel efficiency by learning those channels with the experiences of most successful access.
Each unlicensed user figures out its Q value according to own success experience information, then predicts award through Q value.Use Q value to denote status or action value.Q function can be depicted by Q(s, a) which means the award received by the unlicensed user at status s with action a.Then, the Q value can be updated by the following where Q i (s, a j ) denotes the Q value function attained when user i adopts the corresponding action (selecting chann j ).Besides, a j ∈ A, A denotes the action set affecting the spectrum environment by unlicensed users.For the available channels to be selected by unlicensed users, a j denotes the user chooses channel j ; r(i, j ) denotes the award function in the environment after unlicensed user i selects channel j ; γ (0 ≤ γ ≤ 1) is the discount factor representing the importance of future anticipation award on current award.α(0 ≤ α ≤ 1) is the learning rate, and V (s)[t + 1] is the estimation value of next status function which can be expressed as Award function R: The award value should reflect the learning objective of the proposed algorithm.The target of this paper is to select the channel with most success experience, therefore award function R is related to the situation of whether the unlicensed user succeed in accessing current channel.
Then, when user i accesses channel j , the award function r(i, j ) can be obtained as r(i, j ) = ⎧ ⎨ ⎩ 1, succeed in accessing; 0, wait due to BUSY status; −1, unsuccessful.(9) For the access process, the slot structure of unlicensed users can be given as Fig. 3.In on slot, the operation process of a unlicensed user is as follows: Judge whether it requires to reselect channel, if does, proceed access decision and judge whether to send data according to our strategy.If the judgement result is to transmit data, then perform the transmission and collect ACK feedback for continuous sensing at every following slots.
After unlicensed users decide to access current channel through self-learning, the transmission will be started in probability of f (y) based previous slot's channel status y ∈ y s , y 2 = {idle, busy, success, f ailure}.
During the data transmission, users acknowledge whether the transmission is successful by checking ACK feedback information.If receiving the ACK feedback, we consider the transmission is successful.If not, it is a failure.Users will judge whether the channel can be accessed in every time slot by spectrum sensing.
When the information collection is completed, unlicensed users will analyze the channel's status as follows.
-Idle: The sensing result shows that no user is access the channel and no data is being transmitted.On the other hand, once a licensed user transmits successfully, in the following slot, its transmission will not be disturbed by unlicensed users.Therefore, at the circle of 'on', the conflict emerges only before the first successful transmission of the licensed user.The average conflict number suffering by a licensed user in a 'on' circle can be expressed by T col which has no relation with its service time length T poc .If q = 0 which means f (idle) = 0, then only the idle slot emerges at 'off' circle, we have T col = 0. Otherwise, if q > 0, r = 1, there exists T col = +∞ when unlicensed users do not evade in case of conflict.
The main routine of our Q-Learning-based spectrum access can be given in Fig. 4.
The detailed process can be presented as follows.

Numerical tesults
In this section, we carry out simulated tests in Matlab platform to testify the performances of our dynamic spectrum access method on channel usage rate and conflict probabilities with various parameters.In the tests, we consider there are 50 sensor nodes randomly distributing in the 8 small mesh cells of the industrial field.Each cell has its own cluster head.20 licensed channels are allocated to the mesh cells.If the sensor nodes locate in their own cell, they can access the channels as licensed users.Otherwise, when they migrate to other cells and wish to use the other cluster head, they serve as unlicensed users.In Figs. 5 and 6, we give the performances of channel usage rate with different slot number and th B which is the slot threshold unlicensed users should wait.When th B is fixed, the slot number is set to be 5000.As shown in Fig. 5, with the increase of slot number, the channel usage rate becomes steady but not convergent.A higher slot threshold means more waiting time for unlicensed users.Too long waiting time will lead to relatively low channel usage rate and decrease of system transmission capacity.The unlicensed users need to ensure there are no licensed users available on the target channel in given threshold time.If the channel is still idle for th B slots, unlicensed users will access the channel.
Besides, we give the comparison figures as Figs.7 and 8 to present the performances of our proposed method.In Figs. 7 and 8, the Q-learning method refers to our proposal.SDSA means the simplified dynamic spectrum access scheme which enable the memory function for each sensor nodes.When the unlicensed users wish to access one channel, they should recall and update their channel list to ascertain the situation.However, they do not have the self-learning ability.Aloha denotes the unlicensed users use aloha instruction to communicate with each other before they begin to access a channel.All the methods above mentioned have an assumption that the sensor nodes do not have perfect sensing capability and are organized in distributed mode in IIoT.We can obtain from the figures that our scheme has steady channel usage rate outperforming the traditional SDSA and Aloha method.Even our proposal's complexity is relatively high, it can be easily realized especially with the rapid development of mobile computing and cognitive science.

Conclusions
In this paper, we propose a Q-learning-based dynamic spectrum access method in IIoT by taking into account the heterogenous wireless sensor networks' characteristics to enhance spectrum efficiency and degrade accessing conflict probability.The main contribution of this work lies in that we introduce a self-learning method to address the situation where sensor nodes' sensing ability is non-perfect and distributed network mode are applied.In specific, we devised a self-learning-based MAC protocol to assist unlicensed user to access spectrum in IIoT when only single idle channel is available.Besides, for the case of accessing multi-channels simultaneously, we propose a Q-learningbased access algorithm which considers the unlicensed users to select the idle channels with most Q value through self-learning.The specific algorithm routine has been given.Numerical results prove that the proposed algorithm has better channel accessing effects compared with traditional simplified self-access protocol and aloha method.

Fig. 2
Fig. 2 Mechanism of Q-Learning-based spectrum access algorithm

Fig. 6
Fig. 6 Channel usage rate with different th B

Fig. 7 Fig. 8
Fig. 7 Channel usage rate with various access schemes