Skip to main content

Pareto Deterministic Policy Gradients and Its Application in 6G Networks

  • Chapter
  • First Online:
Fundamentals of 6G Communications and Networking

Part of the book series: Signals and Communication Technology ((SCT))

  • 763 Accesses

Abstract

In this chapter, we introduce a reinforcement learning (RL)-based approach to jointly optimize cell load balance and network throughput as a potential AI/ML-based use case for sixth-generation cellular systems (6G), where inter-cell handover and massive MIMO antenna tilting are configured as the RL policy to learn. Our rationale behind using RL is to circumvent the challenges of analytically modeling user mobility and network dynamics. We integrate vector rewards into multiple value networks and conduct RL action via a separate policy network. We name this method as Pareto deterministic policy gradients (PDPG). It is an actor-critic, model-free, and deterministic policy algorithm which can handle the coupling objectives with the following two merits: (1) It solves the optimization via leveraging the degree of freedom of vector reward as opposed to choosing handcrafted scalar reward; (2) cross-validation over multiple policies can be significantly reduced. To be self-contained, an ideal static optimization-based brute-force search solver is included as the benchmark method. The comparison shows that the RL approach performs as well as this ideal strategy, though the former one is constrained with limited environment observations and lower action frequency, whereas the latter one has full access to the user mobility.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Shafin, R., Liu, L., Chandrasekhar, V., Chen, H., Reed, J., Zhang, J.C.: Artificial intelligence-enabled cellular networks: A critical path to beyond-5G and 6G. IEEE Wireless Communications 27(2), 212–217 (2020)

    Article  Google Scholar 

  2. Klaine, P.V., Imran, M.A., Onireti, O., Souza, R.D.: A survey of machine learning techniques applied to self-organizing cellular networks. IEEE Communications Surveys & Tutorials 19(4), 2392–2431 (2017)

    Article  Google Scholar 

  3. Mao, Y., You, C., Zhang, J., Huang, K., Letaief, K.B.: A survey on mobile edge computing: The communication perspective. IEEE Communications Surveys Tutorials 19(4), 2322–2358 (2017)

    Article  Google Scholar 

  4. Nam, Y.-H., Ng, B.L., Sayana, K., Li, Y., Zhang, J., Kim, Y., Lee, J.: Full-dimension mimo (FD-MIMO) for next generation cellular technology. IEEE Communications Magazine 51(6), 172–179 (2013)

    Article  Google Scholar 

  5. Zhang, H., Liu, N., Chu, X., Long, K., Aghvami, A., Leung, V.C.M.: Network slicing based 5G and future mobile networks: Mobility, resource management, and challenges. IEEE Communications Magazine 55(8), 138–145 (2017)

    Article  Google Scholar 

  6. Imran, A., Zoha, A., Abu-Dayya, A.: Challenges in 5G: how to empower son with big data for enabling 5G. IEEE network 28(6), 27–33 (2014)

    Article  Google Scholar 

  7. Ruiz-Aviles, J.M., Toril, M., Luna-Ramírez, S., Buenestado, V., Regueira, M.: Analysis of limitations of mobility load balancing in a live lte system. IEEE wireless communications letters 4(4), 417–420 (2015)

    Article  Google Scholar 

  8. Andrews, J.G., Singh, S., Ye, Q., Lin, X., Dhillon, H.S.: An overview of load balancing in hetnets: Old myths and open problems. IEEE Wireless Communications 21(2), 18–25 (2014)

    Article  Google Scholar 

  9. Ye, Q., Rong, B., Chen, Y., Al-Shalash, M., Caramanis, C., Andrews, J.G.: User association for load balancing in heterogeneous cellular networks. IEEE Transactions on Wireless Communications 12(6), 2706–2716 (2013)

    Article  Google Scholar 

  10. Damnjanovic, A., Montojo, J., Wei, Y., Ji, T., Luo, T., Vajapeyam, M., Yoo, T., Song, O., Malladi, D.: A survey on 3GPP heterogeneous networks. IEEE Wireless communications 18(3), 10–21 (2011)

    Article  Google Scholar 

  11. Lopez-Perez, D., Guvenc, I., Chu, X.: Mobility management challenges in 3GPP heterogeneous networks. IEEE Communications Magazine 50(12), 70–78 (2012)

    Article  Google Scholar 

  12. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al.: Mastering atari, Go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265 (2019)

    Google Scholar 

  13. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

    Google Scholar 

  14. Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)

    Article  Google Scholar 

  15. Zhou, Z., Li, X., Zare, R.N.: Optimizing chemical reactions with deep reinforcement learning. ACS central science 3(12), 1337–1344 (2017)

    Article  Google Scholar 

  16. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)

    Google Scholar 

  17. 38.331, G.T.: NR; radio resource control (RRC); protocol specification

    Google Scholar 

  18. 38.213, G.T.: NR; physical layer procedures for control

    Google Scholar 

  19. Bethanabhotla, D., Bursalioglu, O.Y., Papadopoulos, H.C., Caire, G.: User association and load balancing for cellular massive mimo. In: 2014 Information Theory and Applications Workshop (ITA), pp. 1–10 (2014). IEEE

    Google Scholar 

  20. Razaviyayn, M., Hong, M., Luo, Z.-Q.: Linear transceiver design for a mimo interfering broadcast channel achieving max–min fairness. Signal Processing 93(12), 3327–3340 (2013)

    Article  Google Scholar 

  21. Singh, S., Dhillon, H.S., Andrews, J.G.: Offloading in heterogeneous networks: Modeling, analysis, and design insights. IEEE Transactions on Wireless Communications 12(5), 2484–2497 (2013)

    Article  Google Scholar 

  22. Hasan, M.M., Kwon, S., Na, J.: Adaptive mobility load balancing algorithm for lte small-cell networks. IEEE Transactions on Wireless Communications 17(4), 2205–2217 (2018)

    Article  Google Scholar 

  23. Wang, H., Ding, L., Wu, P., Pan, Z., Liu, N., You, X.: Dynamic load balancing and throughput optimization in 3gpp lte networks. In: Proceedings of the 6th International Wireless Communications and Mobile Computing Conference, pp. 939–943 (2010)

    Google Scholar 

  24. Son, K., Chong, S., De Veciana, G.: Dynamic association for load balancing and interference avoidance in multi-cell networks. IEEE Transactions on Wireless Communications 8(7), 3566–3576 (2009)

    Article  Google Scholar 

  25. Ao, W.C., Psounis, K.: Approximation algorithms for online user association in multi-tier multi-cell mobile networks. IEEE/ACM Transactions on Networking 25(4), 2361–2374 (2017)

    Article  Google Scholar 

  26. Mwanje, S.S., Mitschele-Thiel, A.: A q-learning strategy for lte mobility load balancing. In: 2013 IEEE 24th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), pp. 2154–2158 (2013). IEEE

    Google Scholar 

  27. Mwanje, S.S., Schmelz, L.C., Mitschele-Thiel, A.: Cognitive cellular networks: A q-learning framework for self-organizing networks. IEEE Transactions on Network and Service Management 13(1), 85–98 (2016)

    Article  Google Scholar 

  28. Kudo, T., Ohtsuki, T.: Q-learning based cell selection for ue outage reduction in heterogeneous networks. In: 2014 IEEE 80th Vehicular Technology Conference (VTC2014-Fall), pp. 1–5 (2014). IEEE

    Google Scholar 

  29. Xu, Y., Xu, W., Wang, Z., Lin, J., Cui, S.: Deep reinforcement learning based mobility load balancing under multiple behavior policies. In: ICC 2019–2019 IEEE International Conference on Communications (ICC), pp. 1–6 (2019). IEEE

    Google Scholar 

  30. Xu, Y., Xu, W., Wang, Z., Lin, J., Cui, S.: Load balancing for ultradense networks: A deep reinforcement learning-based approach. IEEE Internet of Things Journal 6(6), 9399–9412 (2019)

    Article  Google Scholar 

  31. Wang, Z., Li, L., Xu, Y., Tian, H., Cui, S.: Handover control in wireless systems via asynchronous multiuser deep reinforcement learning. IEEE Internet of Things Journal 5(6), 4296–4307 (2018)

    Article  Google Scholar 

  32. Zappone, A., Sanguinetti, L., Debbah, M.: User association and load balancing for massive mimo through deep learning. In: 2018 52nd Asilomar Conference on Signals, Systems, and Computers, pp. 1262–1266 (2018). IEEE

    Google Scholar 

  33. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

    Google Scholar 

  34. Van Moffaert, K., Nowé, A.: Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research 15(1), 3483–3512 (2014)

    MathSciNet  MATH  Google Scholar 

  35. Miettinen, K., Mäkelä, M.M.: On scalarizing functions in multiobjective optimization. OR spectrum 24(2), 193–213 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  36. Van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., Tsang, J.: Hybrid reward architecture for reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 5392–5402 (2017)

    Google Scholar 

  37. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of artificial intelligence research 4, 237–285 (1996)

    Article  Google Scholar 

  38. Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine learning 3(1), 9–44 (1988)

    Article  Google Scholar 

  39. Munos, R., Stepleton, T., Harutyunyan, A., Bellemare, M.: Safe and efficient off-policy reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1054–1062 (2016)

    Google Scholar 

  40. Lee, K., Hong, S., Kim, S.J., Rhee, I., Chong, S.: Slaw: A new mobility model for human walks. In: IEEE INFOCOM 2009, pp. 855–863 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhou Zhou .

Editor information

Editors and Affiliations

A Comparison: A Static Formulation

A Comparison: A Static Formulation

We now consider a static formulation of the joint optimization on load balancing and throughput maximization. To do so, we directly drop the expectation and time constraint in (11) and set \(\gamma = 0\). Therefore, we have the following formulation:

$$\displaystyle \begin{aligned} {} \begin{aligned} &\max_{{\boldsymbol I}(t), {\boldsymbol b}(t) } \sum_n R_n(t) + \sum_n F_n(t)\\ &s.t. \quad \boldsymbol{I}(t)=\{I_{n, k}(t): I_{n,k}(t) \in \{0, 1\}, n \in N, k \in K\}\\ &\qquad {\boldsymbol b}(t) = \{b_{n}(t):b_{n}(t) \in \{\theta_0, \theta_1, \cdots, \theta_{M-1} \}, n \in N\}.\\ \end{aligned} \end{aligned} $$
(19)

Comparing the above formulation to (11), we can notice the following difference and practical limitations:

  • To solve this static problem, it requires perfect knowledge of all \({p_{n,k}(b_n(t))}\) at every sample time which imposes a large overhead to the user feedback.

  • Due to the integer constraints, the complexity becomes very high. Moreover, it requires BSs to solve out \({\boldsymbol I}(t)\) and \({\boldsymbol b}(t)\) on every time slot.

  • The formulation treats user association as a variable. However, the user association cannot be directly translated to the values of the CIOs in A3 events. Therefore, it is not compatible to the handover operations in current cellular systems.

Therefore, we only consider the above formulation as an evaluation approach to our RL algorithm in the simulation. Ideally, the static way can yield the optimal solution at every time. Thus, it can serve as an upper bound for the RL algorithm.

Algorithm 1 Relaxed brute force for a small λ

1.1 A.1 Heuristic Brute-Force Solvers

Note that (19) can be equivalently written as

$$\displaystyle \begin{aligned} {} \begin{aligned} &\max_{{\boldsymbol I}(t), {\boldsymbol b}(t) } \sum_n F_n({\boldsymbol I}(t), {\boldsymbol b}(t))\\ &s.t. \quad \boldsymbol{I}(t)=\{I_{n, k}(t): I_{n,k}(t) \in \{0, 1\}, n \in N, k \in K\}\\ &\qquad {\boldsymbol b}(t) = \{b_{n}(t):b_{n}(t) \in \{\theta_0, \theta_1, \cdots, \theta_{M-1} \}, n \in N\}\\ &\qquad R_n({\boldsymbol I}(t), {\boldsymbol b}(t)) > \phi, n \in N \end{aligned} \end{aligned} $$
(20)

where \(\phi \) is a parameter to avoid trivial solutions, such as all users are disconnected to BSs (in this case, all cell loads are zero). Since we do not have an analytical expression of the objective, brute-force search is considered as our primary approach. However, the number of user association combinations is prohibitively large. To narrow down the searching region, we consider a heuristic approach to approximate

$$\displaystyle \begin{aligned} \max_{{\boldsymbol I}(t) } \sum_n F_n({\boldsymbol I}(t), {\boldsymbol b}(t)) \end{aligned} $$
(21)

for a given \({\boldsymbol b}(t)\). We consider determining the user association in a round-robin manner: Each BS is allocated with an equal number of associated users, where the associated user for each BS is based on the ranking of user rates. Intuitively, this can be considered as a heuristic way to average the throughput \(R_n({\boldsymbol I}(t), {\boldsymbol b}(t))\) over all cells. Accordingly, the algorithm is summarized in Algorithm 1.

Algorithm 2 Relaxed brute-force for a fair λ

Alternatively, the inner loop for user association assignment can be considered as solving

$$\displaystyle \begin{aligned} \max_{{\boldsymbol I}(t) } \sum_n R_n({\boldsymbol I}(t), {\boldsymbol b}(t)) \end{aligned} $$
(22)

A heuristic approach is assigning users to the BS with maximum transmission power. This association strategy may break the cell load balance but avoid user failure links. Therefore, we can combine the previous two heuristic association strategies and obtain Algorithm 2. Particularly, the two user association strategies are mixed through a random binary decision, where the decision threshold is proportional to the weight ratio between the two objectives, i.e., \({\lambda \over {1+\lambda }}\).

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhou, Z., Xin, Y., Chen, H., Zhang, C., Liu, L., Yang, K. (2024). Pareto Deterministic Policy Gradients and Its Application in 6G Networks. In: Lin, X., Zhang, J., Liu, Y., Kim, J. (eds) Fundamentals of 6G Communications and Networking. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-031-37920-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-37920-8_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-37919-2

  • Online ISBN: 978-3-031-37920-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics