Skip to main content

Advertisement

Log in

An actor-critic reinforcement learning-based resource management in mobile edge computing systems

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Reinforcement learning (RL) as an effective tool has attracted great attention in wireless communication field nowadays. In this paper, we investigate the offloading decision and resource allocation problem in mobile edge computing (MEC) systems based on RL methods. Different from existing literature, our research focuses on improving mobile operators’ revenue by maximizing the amount of the offloaded tasks while decreasing the energy expenditure and time-delays. Considering the dynamic characteristics of wireless environment, the above problem is modeled as a Markov decision process (MDP). Since the action space of the MDP is multidimensional continuous variables mixed with discrete variables, traditional RL algorithms are powerless. Therefore, an actor-critic (AC) with eligibility traces algorithm is proposed to resolve the problem. The actor part introduces the parameterized normal distribution to generate the probabilities of continuous stochastic actions, and the critic part employs a linear approximator to estimate the value of states, based on which the actor part updates policy parameters in the direction of performance improvement. Furthermore, an advantage function is designed to reduce the variance of the learning process. Simulation results indicate that the proposed algorithm can find the best strategy to maximize the amount of the tasks executed by the MEC server while decreasing the energy consumption and time-delays.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Al-Shuwaili A, Simeone O (2017) Energy-efficient resource allocation for mobile edge computing-based augmented reality applications. IEEE Wirel Commun Lett 6(3):398–401

    Article  Google Scholar 

  2. Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern SMC 13(5):834–846

    Article  Google Scholar 

  3. Burd TD, Brodersen RW (1996) Processor design for portable systems. J VLSI Signal Process Syst 13(3):203–221

    Article  Google Scholar 

  4. Dinh TQ, Tang J, La QD, Quek TQS (2017) Offloading in mobile edge computing: task allocation and computational frequency scaling. IEEE Trans Commun 65(8):3571–3584

    Google Scholar 

  5. Ge C, Wang N, Foster G, Wilson M (2017) Toward qoe-assured 4k video-on-demand delivery through mobile edge virtualization with adaptive prefetching. IEEE Trans Multimed 19(10):2222–2237

    Article  Google Scholar 

  6. Gong J, Zhou S, Zhou Z, Niu Z (2017) Policy optimization for content push via energy harvesting small cells in heterogeneous networks. IEEE Trans Wirel Commun 16(2):717–729

    Article  Google Scholar 

  7. He Y, Zhao N, Yin H (2018) Integrated networking, caching, and computing for connected vehicles: a deep reinforcement learning approach. IEEE Trans Veh Technol 67(1):44–55

    Article  Google Scholar 

  8. Kim Y, Kwak J, Chong S (2015) Dual-side dynamic controls for cost minimization in mobile cloud computing systems. In: 2015 13th International symposium on modeling and optimization in mobile, ad hoc, and wireless networks (WiOpt), pp. 443–450

  9. Kim Y, Kwak J, Chong S (2018) Dual-side optimization for cost-delay tradeoff in mobile edge computing. IEEE Trans Veh Technol 67(2):1765–1781

    Article  Google Scholar 

  10. Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–1166

    Article  MathSciNet  MATH  Google Scholar 

  11. Kwak J, Kim Y, Lee J, Chong S (2015) Dream: dynamic resource and task allocation for energy minimization in mobile cloud systems. IEEE J Sel Areas Commun 33(12):2510–2523

    Article  Google Scholar 

  12. Lakshminarayana S, Quek TQS, Poor HV (2014) Cooperation and storage tradeoffs in power grids with renewable energy resources. IEEE J Sel Areas Commun 32(7):1386–1397

    Article  Google Scholar 

  13. Lee G, Saad W, Bennis M, Mehbodniya A, Adachi F (2017) Online ski rental for on/off scheduling of energy harvesting base stations. IEEE Trans Wirel Commun 16(5):2976–2990

    Article  Google Scholar 

  14. Liang C, He Y, Yu FR, Zhao N (2017) Enhancing qoe-aware wireless edge caching with software-defined wireless networks. IEEE Trans Wirel Commun 16(10):6912–6925

    Article  Google Scholar 

  15. Mach P, Becvar Z (2017) Mobile edge computing: a survey on architecture and computation offloading. IEEE Commun Surv Tutor 19(3):1628–1656

    Article  Google Scholar 

  16. Mao Y, Zhang J, Letaief KB (2016) Dynamic computation offloading for mobile-edge computing with energy harvesting devices. IEEE J Sel Areas Commun 34(12):3590–3605

    Article  Google Scholar 

  17. Mao Y, Zhang J, Song SH, Letaief KB (2017) Stochastic joint radio and computational resource management for multi-user mobile-edge computing systems. IEEE Trans Wirel Commun 16(9):5994–6009

    Article  Google Scholar 

  18. Miao G, Himayat N, Li GY (2010) Energy-efficient link adaptation in frequency-selective channels. IEEE Trans Commun 58(2):545–554

    Article  Google Scholar 

  19. Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist systems. Univ. Cambridge, Cambridge

    Google Scholar 

  20. Sanguanpuak T, Guruacharya S, Rajatheva N, Bennis M, Latva-Aho M (2017) Multi-operator spectrum sharing for small cell networks: a matching game perspective. IEEE Trans Wirel Commun 16(6):3761–3774

    Article  Google Scholar 

  21. Sardellitti S, Scutari G, Barbarossa S (2015) Joint optimization of radio and computational resources for multicell mobile-edge computing. IEEE Trans Signal Inf Process Over Netw 1(2):89–103

    Article  MathSciNet  Google Scholar 

  22. Suto K, Nishiyama H, Kato N (2017) Postdisaster user location maneuvering method for improving the qoe guaranteed service time in energy harvesting small cell networks. IEEE Trans Veh Technol 66(10):9410–9420

    Article  Google Scholar 

  23. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

    MATH  Google Scholar 

  24. Sutton RS, Barto AG (2017) Reinforcement learning: an introduction. MIT Press, Cambridge

    MATH  Google Scholar 

  25. Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. ACM Sigmetrics Perform. Eval. Rev. 12:1057–1063

    Google Scholar 

  26. Vogeleer KD, Memmi G, Jouvelot P, Coelho F (2013) The energy/frequency convexity rule: Modeling and experimental validation on mobile devices. In: Proceedings of international conference on parallel processing and applied mathematical (PPAM), Warsaw, Poland, pp 93–803

  27. Wang C, Liang C, Yu FR, Chen Q, Tang L (2017) Computation offloading and resource allocation in wireless cellular networks with mobile edge computing. IEEE Trans Wirel Commun 16(8):4924–4938

    Article  Google Scholar 

  28. Wang C, Yu FR, Liang C, Chen Q, Tang L (2017) Joint computation offloading and interference management in wireless cellular networks with mobile edge computing. IEEE Trans Veh Technol 66(8):7432–7445

    Article  Google Scholar 

  29. Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8:279–292

    MATH  Google Scholar 

  30. Wei Y, Yu FR, Song M, Han Z (2019) Joint optimization of caching, computing, and radio resources for fog-enabled iot using natural actorccritic deep reinforcement learning. IEEE Int Things J 6(2):2061–2073

    Article  Google Scholar 

  31. Wolfstetter E (1999) Topics in microeconomics: industrial organization, auctions, and incentives. Cambridge University Press, Cambridge

    Book  Google Scholar 

  32. Yang L, Cao J, Yuan Y, Li T, Han A, Chan A (2013) A framework for partitioning and execution of data stream applications in mobile cloud computing. ACM SIGMETRICS Perform Eval Rev 40(4):23–32

    Article  Google Scholar 

  33. You C, Huang K, Chae H, Kim BH (2017) Energy-efficient resource allocation for mobile-edge computation offloading. IEEE Trans Wirel Commun 16(3):1397–1411. https://doi.org/10.1109/TWC.2016.2633522

    Article  Google Scholar 

  34. Zhang Z, Wang R, Yu FR, Fu F, Yan Q (2019) Qos aware transcoding for live streaming in edge-clouds aided hetnets: an enhanced actor-critic approach. IEEE Trans Veh Technol 68(11):11295–11308

    Article  Google Scholar 

  35. Zhang Z, Yu FR, Fu F, Yan Q, Wang Z (2018) Joint offloading and resource allocation in mobile edge computing systems: an actor-critic approach. In: 2018 IEEE global communications conference (GLOBECOM), pp. 1–6

  36. Zhao P, Tian H, Qin C, Nie G (2017) Energy-saving offloading by jointly allocating radio and computational resources for mobile edge computing. IEEE Access 5:11255–11268

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhicai Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

Appendix 1

In the following, we discuss the obtained solution by the actor-critic algorithm is global optimal. First, we discuss the quasi-concavity of the utility function in two cases.

Case 1:\({T_n}(t) + {c_n}(t){\varLambda _n}(t) > D_n^{\max }(\varDelta t) = \varDelta t{f_n}(t)L_e^{ - 1}\), we have \({D_n}(t) = D_n^{\max }(\varDelta t)\) and \({H_n}(t) = {T_n}(t) + {c_n}(t){\varLambda _n}(t) - D_n^{\max }(\varDelta t)\). The utility function is rewritten as,

$$\begin{aligned} {\varOmega _n}(t) = \varTheta _n {(t)} + {\varUpsilon _n}(t)+{\varXi _n(t)}, \end{aligned}$$
(45)

where \(\varTheta _n {(t)}= ({\rho _n} - {\upsilon _n}\xi f_n^2(t){L_e} - {\omega _n}){D_n}(t)\), \({\varUpsilon _n}(t) = -{{\upsilon _n}D_n^{dl}(t)(1 - {b_n}(t))}[{p_n}(t) + p_n^{cir}]/{R_n}(t)\), and \({\varXi _n}(t) = -{\omega _n}\{ {T_n}(t) + {c_n}(t){\varLambda _n}(t)\}\).

Definition 1

A function \(\mathcal F\), which maps from a convex set of real n-dimensional vectors \(\mathrm I\) to a real number, is call strictly quasi-concave, if for any \(x_1\), \(x_2\)\(\in \mathrm{I}\), and \({x_1} \ne {x_2}\), \(\mathcal F(\lambda {x_1} + (1 - \lambda ){x_2}) \ge \min \{ \mathcal F({x_1}),\mathcal F({x_2})\} , \forall \lambda ,0 \le \lambda \le 1\) [31].

Based on the proposition C.9 in [31], \({\varUpsilon _n}(t)\) is strictly quasi-concave if and only if the upper contour sets

$$\begin{aligned} UC({\varUpsilon _n}(t),y) = \{ p_n(t) \in \mathrm I|{\varUpsilon _n}(t)\ge y\}, \end{aligned}$$
(46)

are convex for all \(y\in R\). If \(y \ge 0\), the upper contour sets are empty; if \(y<0\), (46) is equivalent to \(UC({\varUpsilon _n}(t),y) = \{ {p_n}(t) > 0|y{R_n}(t) + {\upsilon _n}D_n^{dl}(t)(1 - {b_n}(t))[{p_n}(t) + p_n^{cir}] < 0\}\). Since \(R_n(t)\) is concave in \({p_n}(t)\), the function, \({\mathcal F^*} = y{R_n}(t) + {\upsilon _n}D_n^{dl}(t)(1 - {b_n}(t))[{p_n}(t) + p_n^{cir}]\) is also concave. Therefore, the upper contour set \(UC({\varUpsilon _n}(t),y)\) is convex. Based on the above analysis, \({\varUpsilon _n}(t)\) is strictly quasi-concave. Obviously, \(\varTheta _n {(t)}\) and \({\varXi _n}(t)\) are strictly concave. Therefore, \(\varOmega _n (t)\) is strictly quasi-concave.

Case 2: i.e., \({T_n}(t) + {c_n}(t){\varLambda _n}(t) \le D_n^{\max }(\varDelta t)\), we have \({D_n}(t) = {T_n}(t) + {c_n}(t){\varLambda _n}(t)\) and \({H_n}(t) = 0\) which means no tasks left in the buffer at the tth time slot, the utility function is rewritten as,

$$\begin{aligned} {\varOmega _n}(t) = {{\widehat{\varTheta }}}_n {(t)} + {{{\widehat{\varUpsilon }}} _n}(t), \end{aligned}$$
(47)

where \({{{\widehat{\varTheta }}} _n}(t) = {({\rho _n} - {\upsilon _n}\xi f_n^2(t){L_e})}D_n(t)\) and \({{{\widehat{\varUpsilon }}} _n}(t) = {\varUpsilon _n}(t)\). Similarly, it is easy to proof that \(\varOmega _n (t)\) is strictly quasi-concave.

According to Theorem 2, the policy gradient-based actor-critic RL algorithm of this study is convergent. The actor part utilizes a parameterized Gaussian distribution to generate stochastic actions, and a local optimal policy can be obtained by updating the parameters with the stochastic gradient method [24]. Since the utility function \(\varOmega _n (t)\) is strictly quasi-concave as analyzed above, according to [18, 31], the local optimal solution of strictly quasiconcave functions equals to the global optimal solution.

Therefore, the obtained solution by the RL algorithm is global optimal.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, F., Zhang, Z., Yu, F.R. et al. An actor-critic reinforcement learning-based resource management in mobile edge computing systems. Int. J. Mach. Learn. & Cyber. 11, 1875–1889 (2020). https://doi.org/10.1007/s13042-020-01077-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01077-8

Keywords

Navigation