Skip to main content
Log in

A novel estimator based learning automata algorithm

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Reinforcement learning is one of the subjects of Artificial Intelligence and learning automata have been considered as one of the most powerful tools in this research area. On the evolution of learning automata, the rate of convergence is the most primary goal of designing a learning algorithm. In this paper, we propose a deterministic-estimator based learning automata (LA) of which the estimate of each action is the upper bound of a confidence interval, rather than the Maximum Likelihood Estimate (MLE) that has been widely used in current schemes of Estimator LA. The philosophy here is to assign more confidence on actions that are selected only for a few times, so that the automaton is encouraged to explore the uncertain actions. When all the actions have been fully explored, the automaton behaves just like the Generalized Pursuit Algorithm. A refined analysis is presented to show the 𝜖-optimality of the proposed algorithm. It has been demonstrated by extensive simulations that the presented learning automaton (LA) is faster than any deterministic estimator learning automata that have been reported to date. Moreover, we extend our algorithm to the stochastic estimator schemes. It is also shown that the extended LA has achieved a significant performance improvement, comparing with the current state of the art algorithm of learning automata, especially in complex and confusing environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. A larger value of γ indicates the environment is complex and confusing, and vice versa. However, as we mentioned in Section 2.3, it’s hard to get a proper γ in practical applications. It’s unreasonable to assume we have these prior knowledge in the simulations either, so the comparison is somewhat unfair. Our algorithm is extended in the next Section to make a fair comparison with SE RI .

References

  1. Agache M, Oommen BJ (2002) Generalized pursuit learning schemes: new families of continuous and discretized learning automata. IEEE Trans Syst Man Cybern B Cybern 32(6):738–749

    Article  Google Scholar 

  2. Barto AG (1998) Reinforcement learning: an introduction. MIT press

  3. Crofts AE (1982) On a property of the f distribution. Trabajos de Estadstica y de Investigacin Operativa 33(2):110–111

    Article  Google Scholar 

  4. Fathy F, Salek N, Masoudi Y, Laleh E (2013) Distributing of patterns in cutter machines boards using learning automata. In: International Conference on Communication Systems and Network Technologies (CSNT), 2013, pp 774–777

  5. Jiang W, Zhao CL, Li SH, Chen L (2014) A new learning automata based approach for online tracking of event patterns. Neurocomputing 137:205–211

    Article  Google Scholar 

  6. Krishna PV, Misra S, Joshi D, Obaidat MS (2013) Learning automata based sentiment analysis for recommender system on cloud. In: International Conference on Computer, Information and Telecommunication Systems (CITS), 2013, IEEE, pp 1–5

  7. Lanctôt JK, Oommen BJ (1992) Discretized estimator learning automata. IEEE Trans Syst Man Cybern 22(6):1473–1483

    Article  Google Scholar 

  8. Leemis LM, Trivedi KS (1996) A comparison of approximate interval estimators for the bernoulli parameter. The American Statistician 50(1):63–68

    MathSciNet  Google Scholar 

  9. Lukacs E (1970) Characteristic functions, vol 4. Griffin, London

  10. Martin R, Tilak O, et al. (2012) On 𝜖-optimality of the pursuit learning algorithm. J Appl Probab 49(3):795–805

    Article  MATH  MathSciNet  Google Scholar 

  11. Misra S, Krishna P, Saritha V, Obaidat M (2013) Learning automata as a utility for power management in smart grids. IEEE Commun Mag 51(1):98–104

    Article  Google Scholar 

  12. Misra S, Krishna P, Kalaiselvan K, Saritha V, Obaidat M (2014) Learning automata-based qos framework for cloud iaas. IEEE Trans Netw Serv Manag 99:1–10

    Google Scholar 

  13. Moradabadi B, Beigy H (2013) A new real-coded bayesian optimization algorithm based on a team of learning automata for continuous optimization. Genet Program Evolvable Mach:1–25

  14. Narendra KS, Thathachar M (1974) Learning automata-a survey. IEEE Trans. Syst. Man Cybern. 4:323–334

    Article  MATH  MathSciNet  Google Scholar 

  15. Narendra KS, Thathachar MA (2012) Learning automata: an introduction. Courier Dover Publications

  16. Oommen BJ, Agache M (2001) Continuous and discretized pursuit learning schemes: various algorithms and their comparison. IEEE Trans Syst Man Cybern B Cybern 31(3):277–287

    Article  Google Scholar 

  17. Oommen BJ, Lanctôt JK (1990) Discretized pursuit learning automata. IEEE Trans Syst Man Cybern 20(4):931–938

    Article  MATH  Google Scholar 

  18. Papadimitriou GI (1994) A new approach to the design of reinforcement schemes for learning automata: stochastic estimator learning algorithms. IEEE Trans Knowl Data Eng 6(4):649–654

    Article  MathSciNet  Google Scholar 

  19. Papadimitriou GI, Sklira M, Pomportsis AS (2004) A new class of ε-optimal learning automata. IEEE Trans Syst Man Cybern B Cybern 34(1):246–254

    Article  Google Scholar 

  20. Rasouli N, Meybodi M, Morshedlou H (2013) Virtual machine placement in cloud systems using learning automata. In: 13th Iranian conference on fuzzy systems (IFSC), 2013, pp 1–5

  21. Rezvanian A, Rahmati M, Meybodi MR (2014) Sampling from complex networks using distributed learning automata. Physica A: Stat Mech Appl 396:224–234

    Article  Google Scholar 

  22. Sastry P (1985) Systems of learning automata: estimator algorithms applications. PhD thesis. Ph. D. Thesis, Dept. of Electrical Engineering, Indian Institute of Science, Bangalore

    Google Scholar 

  23. Thathachar M, Oommen B (1979) Discretized reward-inaction learning automata. J Cybern Inf Sci 2(1):24–29

    Google Scholar 

  24. Thathachar M, Sastry P (1985) A new approach to the design of reinforcement schemes for learning automata. IEEE Trans Syst Man Cybern 1:168–175

    Article  MathSciNet  Google Scholar 

  25. Tsetlin M (1973) Automaton theory and modeling of biological systems. Academic Press

  26. Tsetlin ML (1961) On the behavior of finite automata in random media. Avtomatika i Telemekhanika 22:1345–1354

    Google Scholar 

  27. Varshavskii V, Vorontsova I (1963) On the behavior of stochastic automata with variable structure. Autom Remote Control 24(3):327

    MathSciNet  Google Scholar 

  28. Yazidi A, Granmo OC, Oommen B (2013) Learning-automaton-based online discovery and tracking of spatiotemporal event patterns. IEEE Trans Cybern 43(3):1118–1130

    Article  Google Scholar 

  29. Zhang J, Lina N, Chen X, Shangce G, Zheng T (2012) Inertial estimator learning automata. IEICE Trans Fundam Electron Commun Comput Sci 95(6):1041–1048

    Article  Google Scholar 

  30. Zhang X, Granmo OC, Oommen BJ (2013a) On incorporating the paradigms of discretization and bayesian estimation to create a new family of pursuit learning automata. Appl Intell 39(4):782–792

    Article  Google Scholar 

  31. Zhang X, Granmo OC, Oommen BJ, Jiao L (2013b) On using the theory of regular functions to prove the 𝜖-optimality of the continuous pursuit learning automaton. In: Recent Trends in Applied Artificial Intelligence. Springer, pp 262–271

  32. Zhong W, Xu Y, Wang J, Li D, Tianfield H (2014) Adaptive mechanism design and game theoretic analysis of auction-driven dynamic spectrum access in cognitive radio networks. EURASIP J Wirel Commun Netw 2014(1):44

    Article  Google Scholar 

Download references

Acknowledgments

This work is funded by National Science Foundation of China (61271316), 973 Program (2010CB731403, 2010CB731406, 2013CB329603,2013CB329605) of China, Key Laboratory for Shanghai Integrated Information Security Management Technology Research, and Chinese National Engineering Laboratory for Information Content Analysis Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shenghong Li.

Appendices

Appendix

1.1 A Generation Of The New Environments

Environments E 100−1 and E 100−2 have 100 actions each, and environment E 200 has 200 actions. The actions’ reward probability is randomly generated under some restriction.

In E 100−1, all the actions’ reward probability is generated following a uniform distribution from 0 to 0.7, and the 14th action’s reward probability is set to 0.8. In the same way, all the actions’ reward probability of E 100−2 is generated following a uniform distribution from 0 to 0.75, and then the 14th action’s reward probability is set to 0.8. All the actions’ reward probability of E 200 is generated by following a uniform distribution from 0 to 0.6, and then the 99th action’s reward probability is set to 0.8.

It is noted that it’s possible for a randomly generated environment to have identical highest reward probabilities, or to have two highest reward probabilities that are very close to each other. In such cases, the environment is either indistinguishable or too complex. In order to prevent this from happening, we set a single action’s reward to a higher probability than those with the random probabilities, as described in the previous paragraph. This ensures that the difference between the two highest reward probabilities is large enough, and thus the optimal action is unique and more evident.

B Complexity Validation Of The Generated Environments

Next we examine the generated environments to make sure they are complex and confusing enough. The actions’ reward probabilities of the three generated environments are listed as follows:

E 100−1 = {0.6660, 0.6358, 0.6223, 0.6732, 0.3667, 0.2940, 0.3470, 0.5296, 0.5429, 0.2126, 0.2234, 0.4816, 0.5310, 0.8000, 0.6443, 0.1880, 0.6366, 0.1041, 0.5364, 0.2806, 0.1249, 0.2224, 0.1024, 0.2570, 0.3813, 0.0943, 0.3239, 0.6444, 0.0286, 0.2413, 0.4858, 0.0315, 0.1729, 0.4617, 0.2190, 0.4679, 0.4060, 0.6011, 0.3666, 0.5703, 0.4128, 0.1013, 0.1030, 0.1659, 0.1388, 0.0878, 0.4212, 0.0811, 0.4730, 0.2900, 0.6361, 0.5439, 0.5791, 0.2008, 0.0478, 0.4847, 0.1652, 0.1375, 0.4146, 0.1701, 0.5988, 0.3715, 0.0421, 0.1232, 0.5310, 0.5912, 0.3500, 0.1727, 0.1364, 0.3071, 0.3690, 0.5637, 0.4605, 0.2530, 0.2204, 0.5793, 0.5839, 0.3589, 0.3086, 0.2634, 0.4841, 0.4797, 0.4425, 0.4909, 0.4329, 0.0540, 0.4848, 0.5616, 0.6947, 0.4466, 0.3668, 0.1101, 0.4345, 0.5792, 0.3192, 0.5665, 0.0412, 0.2009, 0.3693, 0.0834} E 100−2 = {0.0447, 0.5115, 0.0318, 0.0536, 0.3912, 0.0725, 0.6136, 0.6132, 0.5418, 0.1124, 0.4947, 0.3889, 0.7297, 0.8000, 0.6002, 0.3403, 0.3243, 0.6190, 0.0626, 0.0999, 0.1300, 0.2932, 0.6235, 0.6025, 0.0454, 0.2994, 0.3952, 0.3126, 0.4926, 0.4710, 0.2190, 0.3237, 0.0116, 0.7380, 0.1254, 0.0797, 0.2793, 0.1486, 0.3673, 0.2546, 0.7137, 0.6902, 0.0395, 0.5534, 0.2018, 0.3171, 0.4109, 0.7071, 0.3133, 0.7373, 0.2261, 0.5258, 0.4998, 0.4043, 0.5236, 0.4999, 0.1336, 0.0960, 0.7493, 0.1283, 0.0245, 0.4209, 0.6614, 0.5019, 0.1428, 0.2767, 0.3455, 0.7362, 0.1173, 0.6416, 0.4836, 0.2822, 0.1432, 0.3212, 0.3615, 0.0905, 0.4421, 0.1696, 0.2885, 0.4372, 0.1889, 0.2178, 0.4628, 0.1990, 0.6183, 0.7370, 0.5477, 0.2579, 0.4381, 0.0808, 0.6797, 0.6597, 0.6133, 0.1955, 0.4458, 0.0169, 0.3189, 0.2345, 0.1211, 0.1341} E 200 = {0.2537, 0.0565, 0.3591, 0.2826, 0.4176, 0.4199, 0.3831, 0.0202, 0.0413, 0.1918, 0.3185, 0.3927, 0.2446, 0.4920, 0.4310, 0.5812, 0.3188, 0.1951, 0.0634, 0.3666, 0.4673, 0.2541, 0.0545, 0.1599, 0.0922, 0.1686, 0.2641, 0.3163, 0.2745, 0.5252, 0.3108, 0.5662, 0.3826, 0.5746, 0.1444, 0.4057, 0.1734, 0.4031, 0.4171, 0.0408, 0.1529, 0.1344, 0.4007, 0.5066, 0.2067, 0.4683, 0.4052, 0.0040, 0.3613, 0.2321, 0.5496, 0.0007, 0.2775, 0.2546, 0.2765, 0.4621, 0.1935, 0.4708, 0.2828, 0.0215, 0.1055, 0.4331, 0.2841, 0.0916, 0.2047, 0.3644, 0.1150, 0.4431, 0.1457, 0.5505, 0.1614, 0.4593, 0.1132, 0.1725, 0.0547, 0.3457, 0.4100, 0.3280, 0.2554, 0.3867, 0.3886, 0.4074, 0.3815, 0.5671, 0.1254, 0.4256, 0.1417, 0.0716, 0.3644, 0.2701, 0.2752, 0.3972, 0.4622, 0.2101, 0.3972, 0.2497, 0.5052, 0.4998, 0.8000, 0.3681, 0.3493, 0.3244, 0.5220, 0.1589, 0.1908, 0.0715, 0.5639, 0.3873, 0.2877, 0.3836, 0.3268, 0.3884, 0.3263, 0.4326, 0.3135, 0.5962, 0.1312, 0.0635, 0.0658, 0.0382, 0.2427, 0.2690, 0.2195, 0.4581, 0.3767, 0.4632, 0.5597, 0.5836, 0.1152, 0.0833, 0.4178, 0.0563, 0.3152, 0.3182, 0.5167, 0.2909, 0.2361, 0.4029, 0.4448, 0.3120, 0.2086, 0.0900, 0.3517, 0.1573, 0.0267, 0.4530, 0.1457, 0.2654, 0.4127, 0.2155, 0.4418, 0.2368, 0.4100, 0.4224, 0.2654, 0.0117, 0.1985, 0.2546, 0.1622, 0.1182, 0.4930, 0.2580, 0.5327, 0.2347, 0.4615, 0.2381, 0.4851, 0.4530, 0.2264, 0.1296, 0.4742, 0.5696, 0.1965, 0.4028, 0.2632, 0.5001, 0.4613, 0.1004, 0.5172, 0.5939, 0.3087, 0.5306, 0.3528, 0.0929, 0.1199, 0.2442, 0.4492, 0.4954, 0.4740, 0.1911, 0.3204, 0.0540, 0.0670, 0.0818, 0.4072, 0.2971, 0.1138, 0.2970, 0.0886, 0.0330}

The differences between the two largest reward probabilities of each environment are 0.1053, 0.0507 and 0.2038, respectively.

The three newly generated environments are considered “more complex and confusing” due to the following reasons. Firstly, the number of available actions is far more than the five initial environments, i.e., the automaton has more options to choose from, which makes the environment more complex in nature. Secondly, there are many distracters (actions with probabilities close to the second highest probability, including the second highest) that make the environment more confusing. For example, in E 100−1, the optimal action is action 14, with a reward probability of 0.8, and the reward probability of second best action (action 89) is 0.6947. Apart from it, there are two actions with reward probabilities rather close to action 89, namely action 1 and action 4, with reward probabilities of 0.6660 and 0.6732, respectively, as illustrated in Fig. 4. In this case, the automaton must sample all the actions sufficiently to distinguish the optimal action from the distracters.

Fig. 4
figure 4

Reward probability distribution of the three generated environments

As a result, these environments are described as “more comlex and confusing” environments.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ge, H., Jiang, W., Li, S. et al. A novel estimator based learning automata algorithm. Appl Intell 42, 262–275 (2015). https://doi.org/10.1007/s10489-014-0594-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-014-0594-1

Keywords

Navigation