Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Q-learning Reward Propagation Method for Reducing the Transmission Power of Sensor Nodes in Wireless Sensor Networks

  • 346 Accesses

  • 2 Citations


In wireless sensor networks (WSNs), sensor nodes and sink nodes communicate among themselves to collect and send data. Because of the volume of data transmission, it is important to minimize the power used for this communication. Q-learning can be applied to find the optimal path between two nodes. However, Q-learning suffers from having a significant learning time, meaning that the learning process must be conducted in advance in order to be applicable to WSNs. Many studies have proposed methods to decrease the learning time by reducing the state spaces and updating more Q-values at each update. Reducing the size of the Q-learning state space leads to an inability to execute the optimum action, because the correct state may not be available. Other methods utilize additional information by involving a teacher, control the time flow of Q-learning by considering the number of updates for each Q-value, or use a prioritized queue in the update procedure. Such methods are not well suited to the real-world environment and complexity of WSNs. A more suitable method involves updating the Q-values iteratively. A combination of these updating methods may enhance the reduction in learning time. This paper proposes a reward propagation method (RPM), i.e., a method that integrates various updating algorithms to propagate the reward of the goal state to more Q-values, thus reducing the learning time required for Q-learning. By not only updating the Q-value of the last visited state and executed action, but also updating the Q-values of unvisited states and unexecuted actions, the learning time can be reduced considerably. In this method, we integrate the following three Q-value updating algorithms. First, we incorporate the concept of \(\text{ Q }(\lambda )\)-learning. This method iteratively propagates the terminal reward to the Q-values of the visited states until the terminal reward is received. If the terminal reward were to be propagated to Q-values of unvisited states, the learning time could be further reduced. Second, the concept of reward propagation is expanded. The previous, one-step reward update method updates any Q-values of states in which the terminal reward can be received by executing only one action after receiving the terminal reward. If the terminal reward is propagated to states for which the reward will only be received by executing more than one action, more Q-values can be updated. Third, we apply a type of fuzzy Q-learning with eligibility, which updates the Q-values of unexecuted actions. Even though this method is not utilized directly, the concept of updating the Q-values of unexecuted actions is applied. To investigate how much learning time is reduced with using the proposed method, we compare it with conventional Q-learning. Given that the optimal path problem of WSNs can be remodeled as a reinforcement learning problem, a hunter–prey capture game is utilized, which involves two agents in a grid environment. RPM, which plays the role of prey, learns to receive the terminal reward and escape from the hunter. The hunter selects one of two movement policies randomly, and executes one action based on the selected policy. To measure the difference between the success rates of conventional Q-learning and RPM, an equivalent environment and parameters are used for both these methods. We conduct three experiments: the first to compare RPM and conventional Q-learning, the second to test the scalability of RPM, and the third to evaluate the performance of RPM in a differently configured environment. The RPM results are compared with those of conventional Q-learning in terms of the success rate of receiving the terminal reward. This provides a measure of the difference in required learning times. The greatest reduction in learning time is obtained with a grid size of \(10 \times 10\), no obstacles, and 3,000 episodes to be learned. In these episodes, the success rate of RPM is 232 % higher than that of conventional Q-learning. We perform two experiments to verify the scalability of RPM by changing the size of the environment. In a \(12 \times 12\) grid environment, RPM initially exhibits a maximum success rate 176 % higher than that of conventional Q-learning. However, as the size of the environment is increased, the effect of propagating the terminal rewards decreases, and the improvement in the success rate compared to conventional Q-learning decreases relative to the \(10 \times 10\) grid environment. With a \(14 \times 14\) grid environment, the relative effect of RPM declines further, giving a maximum success rate that is around 138 % higher than that of conventional Q-learning. The results of the scalability experiments show that increasing the size of the environment without changing the scope of terminal reward propagation causes a decrease in the success rate of RPM. However, RPM still exhibits a higher success rate than conventional Q-learning. The improvement in the peak success rate of between 138 % and 232 % can greatly reduce the learning time in difficult environments. If the amount of the calculation of terminal reward propagation is not a critical issue, given that the calculation amount is increased exponentially in proportion to the scope of terminal reward propagation, the learning time can be reduced further easily by increasing the scope. Finally, we compare the difference in success rates when obstacles are deployed within the \(10 \times 10\) environment. The obstacles naturally degraded the performance of both RPM and conventional Q-learning, but the proposed method still outperforms conventional Q-learning by about 20 % to 59 %. Our experimental results show that the peak success rate of the proposed method is consistently superior to that of conventional Q-learning. Given that the size of environment and the number of obstacles effects the improvement of RPM comparing to conventional Q-learning, the improvement is in proportion to the number of more updated Q-values. If the size of environment is increased, the rate how much Q-values are more updated comparing to whole Q-values is decreased. Therefore, the effect of RPM is also decreased. More the number of obstacles is increased, less the number of Q-values are update, which also reduces the effect of RPM. As the learning time can be reduced by RPM, Q-learning can be applied to diverse fields in which the learning time problem affects its application to WSNs. Although a lot of researches of Q-learning are proposed to solve the learning time problem of Q-learning, the demand of reducing the learning time is still urgent. Given that one line of researches that reduce learning time by updating Q-value more is very active topic, we expect further expansion by combining more kinds of concepts to reduce learning time on RPM.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    Nikolidakis, S. A., Kandris, D., Vergados, D. D., & Douligeris, C. (2013). Energy efficient routing in wireless sensor networks through balanced clustering. Algorithms, 6(1), 29–42.

  2. 2.

    Dow, C., & Chen, J. (2003). A distributed fault-tolerant resource planning scheme for wireless networks. Wireless Personals Comunations Journal, 24, 429–445.

  3. 3.

    Chen, J., Zhao, Z., Qu, D., & Zhang, P. (2008). A policy-based approach for reconfiguration management and enforcement in autonomic communication systems. Wireless Personals Comunations Journal, 45, 145–161.

  4. 4.

    Azim, M. M. A. (2010). MAP : A balanced energy consumption routing protocol for wireless sensor networks. Journal of Information Processing Systems, 6(3).

  5. 5.

    Chen, X., Zhao, X., Zhang, H., & Chen, T. (2011). Reinforcement learning enhanced iterative power allocation in stochastic cognitive wireless mesh networks. Wireless Personals Comunations Journal, 57, 89–104.

  6. 6.

    Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.

  7. 7.

    Wang, H., & Song, R. (2012). Distributed Q-learning for interference mitigation in self-organised femtocell networks: synchronous or asynchronous?. Wireless Personals Comunations Journal: Published online.

  8. 8.

    Melo, F. S., & Ribeiro, M. I. (2007). Q-learning with linear function approximation, The 20th Annual Conference on Learning Theory. Lecture Notes in Artificial Intelligence., 4539, 308–322.

  9. 9.

    Thomaz, A. L., Hoffman, G., & Breazeal, C. (2006). Reinforcement learning with human teachers: understanding how people want to teach robots. Robot and Human Interactive, Communication, 352–257.

  10. 10.

    Kormushev, P., Nomoto, K., Dong, F., & Hirota, K. (2008). Time manipulation technique for speeding up reinforcement learning in simulations. International Journal of Cybernetics and Information Technologies, 8(1), 12–24.

  11. 11.

    Kormushev, P., Nomoto, K., Dong, F. & Hirota, K. (2009). Time hopping technique for faster reinforcement learning in simulations, CoRR abs/0904.0545.

  12. 12.

    Kormushev, P., Nomoto, K., Dong, F. & Hirota, K. (2009). Eligibility propagation to speed up time hopping for reinforcement learning, CoRR abs/0904.0546.

  13. 13.

    Seijen, H. V., & Whiteson, S. (2009). Postponed updates for temporal-difference reinforcement learning. Proceedings of the Ninth International Conference on Intelligent Systems Design and Applications, 665–672.

  14. 14.

    Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.

  15. 15.

    Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 103–130.

  16. 16.

    Singh, S., Sutton, R. S., & Kaelbling, P. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 123–158.

  17. 17.

    Lee, S. G. (2006). A cooperation online reinforcement learning approach in Ant-Q. Lecture Notes in Computer Science, 4232, 487–494.

  18. 18.

    Wiering, M. A. (2004). \(\text{ QV }(\lambda )\)-learning: A new on-policy reinforcement learning algorithm. Machine Learning, 55(1), 5–29.

  19. 19.

    Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. Machine Learning, 226–232.

  20. 20.

    McGovern, A., Sutton, R. S., & Fagg, A. H. (1997). Roles of macro-actions in accelerating reinforcement learning. Grace Hopper Celebration of Women in, Computing, 13–18.

  21. 21.

    Schoknecht, R., & Riedmiller, M. (2002). Speeding-up reinforcement learning with multi-step actions. Proceedings of the Twelfth International Conference on Artificial Neural Networks, Lecture Notes in Computer Science, 2415, 813–818.

  22. 22.

    Jeong, S. I., & Lee, Y. J. (2001). Fuzzy Q-learning using distributed eligibility. Journal of Fuzzy Logic and Intelligent Systems, 11(5), 388–394.

  23. 23.

    Kim, B. C., & Yun, B. J. (1999). Reinforcement learning using propagation of goal-state-value. Journal of Korea Information Processing, 6(5), 1303–1311.

  24. 24.

    Hu, J., & Wellman, M. P. (2003). Nash Q-Learning for general-sum stochastic games. Journal of Machine Learning Research, 4, 1039–1069.

Download references


This work was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology (2011-0011266)

Author information

Correspondence to Kyungeun Cho.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Sung, Y., Ahn, E. & Cho, K. Q-learning Reward Propagation Method for Reducing the Transmission Power of Sensor Nodes in Wireless Sensor Networks. Wireless Pers Commun 73, 257–273 (2013). https://doi.org/10.1007/s11277-013-1235-4

Download citation


  • Wireless Sensor Network
  • Transmission Power Problem
  • Q-learning
  • Reward