Abstract
Reinforcement learning is one of the subjects of Artificial Intelligence and learning automata have been considered as one of the most powerful tools in this research area. On the evolution of learning automata, the rate of convergence is the most primary goal of designing a learning algorithm. In this paper, we propose a deterministic-estimator based learning automata (LA) of which the estimate of each action is the upper bound of a confidence interval, rather than the Maximum Likelihood Estimate (MLE) that has been widely used in current schemes of Estimator LA. The philosophy here is to assign more confidence on actions that are selected only for a few times, so that the automaton is encouraged to explore the uncertain actions. When all the actions have been fully explored, the automaton behaves just like the Generalized Pursuit Algorithm. A refined analysis is presented to show the 𝜖-optimality of the proposed algorithm. It has been demonstrated by extensive simulations that the presented learning automaton (LA) is faster than any deterministic estimator learning automata that have been reported to date. Moreover, we extend our algorithm to the stochastic estimator schemes. It is also shown that the extended LA has achieved a significant performance improvement, comparing with the current state of the art algorithm of learning automata, especially in complex and confusing environments.
Similar content being viewed by others
Notes
A larger value of γ indicates the environment is complex and confusing, and vice versa. However, as we mentioned in Section 2.3, it’s hard to get a proper γ in practical applications. It’s unreasonable to assume we have these prior knowledge in the simulations either, so the comparison is somewhat unfair. Our algorithm is extended in the next Section to make a fair comparison with SE RI .
References
Agache M, Oommen BJ (2002) Generalized pursuit learning schemes: new families of continuous and discretized learning automata. IEEE Trans Syst Man Cybern B Cybern 32(6):738–749
Barto AG (1998) Reinforcement learning: an introduction. MIT press
Crofts AE (1982) On a property of the f distribution. Trabajos de Estadstica y de Investigacin Operativa 33(2):110–111
Fathy F, Salek N, Masoudi Y, Laleh E (2013) Distributing of patterns in cutter machines boards using learning automata. In: International Conference on Communication Systems and Network Technologies (CSNT), 2013, pp 774–777
Jiang W, Zhao CL, Li SH, Chen L (2014) A new learning automata based approach for online tracking of event patterns. Neurocomputing 137:205–211
Krishna PV, Misra S, Joshi D, Obaidat MS (2013) Learning automata based sentiment analysis for recommender system on cloud. In: International Conference on Computer, Information and Telecommunication Systems (CITS), 2013, IEEE, pp 1–5
Lanctôt JK, Oommen BJ (1992) Discretized estimator learning automata. IEEE Trans Syst Man Cybern 22(6):1473–1483
Leemis LM, Trivedi KS (1996) A comparison of approximate interval estimators for the bernoulli parameter. The American Statistician 50(1):63–68
Lukacs E (1970) Characteristic functions, vol 4. Griffin, London
Martin R, Tilak O, et al. (2012) On 𝜖-optimality of the pursuit learning algorithm. J Appl Probab 49(3):795–805
Misra S, Krishna P, Saritha V, Obaidat M (2013) Learning automata as a utility for power management in smart grids. IEEE Commun Mag 51(1):98–104
Misra S, Krishna P, Kalaiselvan K, Saritha V, Obaidat M (2014) Learning automata-based qos framework for cloud iaas. IEEE Trans Netw Serv Manag 99:1–10
Moradabadi B, Beigy H (2013) A new real-coded bayesian optimization algorithm based on a team of learning automata for continuous optimization. Genet Program Evolvable Mach:1–25
Narendra KS, Thathachar M (1974) Learning automata-a survey. IEEE Trans. Syst. Man Cybern. 4:323–334
Narendra KS, Thathachar MA (2012) Learning automata: an introduction. Courier Dover Publications
Oommen BJ, Agache M (2001) Continuous and discretized pursuit learning schemes: various algorithms and their comparison. IEEE Trans Syst Man Cybern B Cybern 31(3):277–287
Oommen BJ, Lanctôt JK (1990) Discretized pursuit learning automata. IEEE Trans Syst Man Cybern 20(4):931–938
Papadimitriou GI (1994) A new approach to the design of reinforcement schemes for learning automata: stochastic estimator learning algorithms. IEEE Trans Knowl Data Eng 6(4):649–654
Papadimitriou GI, Sklira M, Pomportsis AS (2004) A new class of ε-optimal learning automata. IEEE Trans Syst Man Cybern B Cybern 34(1):246–254
Rasouli N, Meybodi M, Morshedlou H (2013) Virtual machine placement in cloud systems using learning automata. In: 13th Iranian conference on fuzzy systems (IFSC), 2013, pp 1–5
Rezvanian A, Rahmati M, Meybodi MR (2014) Sampling from complex networks using distributed learning automata. Physica A: Stat Mech Appl 396:224–234
Sastry P (1985) Systems of learning automata: estimator algorithms applications. PhD thesis. Ph. D. Thesis, Dept. of Electrical Engineering, Indian Institute of Science, Bangalore
Thathachar M, Oommen B (1979) Discretized reward-inaction learning automata. J Cybern Inf Sci 2(1):24–29
Thathachar M, Sastry P (1985) A new approach to the design of reinforcement schemes for learning automata. IEEE Trans Syst Man Cybern 1:168–175
Tsetlin M (1973) Automaton theory and modeling of biological systems. Academic Press
Tsetlin ML (1961) On the behavior of finite automata in random media. Avtomatika i Telemekhanika 22:1345–1354
Varshavskii V, Vorontsova I (1963) On the behavior of stochastic automata with variable structure. Autom Remote Control 24(3):327
Yazidi A, Granmo OC, Oommen B (2013) Learning-automaton-based online discovery and tracking of spatiotemporal event patterns. IEEE Trans Cybern 43(3):1118–1130
Zhang J, Lina N, Chen X, Shangce G, Zheng T (2012) Inertial estimator learning automata. IEICE Trans Fundam Electron Commun Comput Sci 95(6):1041–1048
Zhang X, Granmo OC, Oommen BJ (2013a) On incorporating the paradigms of discretization and bayesian estimation to create a new family of pursuit learning automata. Appl Intell 39(4):782–792
Zhang X, Granmo OC, Oommen BJ, Jiao L (2013b) On using the theory of regular functions to prove the 𝜖-optimality of the continuous pursuit learning automaton. In: Recent Trends in Applied Artificial Intelligence. Springer, pp 262–271
Zhong W, Xu Y, Wang J, Li D, Tianfield H (2014) Adaptive mechanism design and game theoretic analysis of auction-driven dynamic spectrum access in cognitive radio networks. EURASIP J Wirel Commun Netw 2014(1):44
Acknowledgments
This work is funded by National Science Foundation of China (61271316), 973 Program (2010CB731403, 2010CB731406, 2013CB329603,2013CB329605) of China, Key Laboratory for Shanghai Integrated Information Security Management Technology Research, and Chinese National Engineering Laboratory for Information Content Analysis Technology.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix
1.1 A Generation Of The New Environments
Environments E 100−1 and E 100−2 have 100 actions each, and environment E 200 has 200 actions. The actions’ reward probability is randomly generated under some restriction.
In E 100−1, all the actions’ reward probability is generated following a uniform distribution from 0 to 0.7, and the 14th action’s reward probability is set to 0.8. In the same way, all the actions’ reward probability of E 100−2 is generated following a uniform distribution from 0 to 0.75, and then the 14th action’s reward probability is set to 0.8. All the actions’ reward probability of E 200 is generated by following a uniform distribution from 0 to 0.6, and then the 99th action’s reward probability is set to 0.8.
It is noted that it’s possible for a randomly generated environment to have identical highest reward probabilities, or to have two highest reward probabilities that are very close to each other. In such cases, the environment is either indistinguishable or too complex. In order to prevent this from happening, we set a single action’s reward to a higher probability than those with the random probabilities, as described in the previous paragraph. This ensures that the difference between the two highest reward probabilities is large enough, and thus the optimal action is unique and more evident.
B Complexity Validation Of The Generated Environments
Next we examine the generated environments to make sure they are complex and confusing enough. The actions’ reward probabilities of the three generated environments are listed as follows:
E 100−1 = {0.6660, 0.6358, 0.6223, 0.6732, 0.3667, 0.2940, 0.3470, 0.5296, 0.5429, 0.2126, 0.2234, 0.4816, 0.5310, 0.8000, 0.6443, 0.1880, 0.6366, 0.1041, 0.5364, 0.2806, 0.1249, 0.2224, 0.1024, 0.2570, 0.3813, 0.0943, 0.3239, 0.6444, 0.0286, 0.2413, 0.4858, 0.0315, 0.1729, 0.4617, 0.2190, 0.4679, 0.4060, 0.6011, 0.3666, 0.5703, 0.4128, 0.1013, 0.1030, 0.1659, 0.1388, 0.0878, 0.4212, 0.0811, 0.4730, 0.2900, 0.6361, 0.5439, 0.5791, 0.2008, 0.0478, 0.4847, 0.1652, 0.1375, 0.4146, 0.1701, 0.5988, 0.3715, 0.0421, 0.1232, 0.5310, 0.5912, 0.3500, 0.1727, 0.1364, 0.3071, 0.3690, 0.5637, 0.4605, 0.2530, 0.2204, 0.5793, 0.5839, 0.3589, 0.3086, 0.2634, 0.4841, 0.4797, 0.4425, 0.4909, 0.4329, 0.0540, 0.4848, 0.5616, 0.6947, 0.4466, 0.3668, 0.1101, 0.4345, 0.5792, 0.3192, 0.5665, 0.0412, 0.2009, 0.3693, 0.0834} E 100−2 = {0.0447, 0.5115, 0.0318, 0.0536, 0.3912, 0.0725, 0.6136, 0.6132, 0.5418, 0.1124, 0.4947, 0.3889, 0.7297, 0.8000, 0.6002, 0.3403, 0.3243, 0.6190, 0.0626, 0.0999, 0.1300, 0.2932, 0.6235, 0.6025, 0.0454, 0.2994, 0.3952, 0.3126, 0.4926, 0.4710, 0.2190, 0.3237, 0.0116, 0.7380, 0.1254, 0.0797, 0.2793, 0.1486, 0.3673, 0.2546, 0.7137, 0.6902, 0.0395, 0.5534, 0.2018, 0.3171, 0.4109, 0.7071, 0.3133, 0.7373, 0.2261, 0.5258, 0.4998, 0.4043, 0.5236, 0.4999, 0.1336, 0.0960, 0.7493, 0.1283, 0.0245, 0.4209, 0.6614, 0.5019, 0.1428, 0.2767, 0.3455, 0.7362, 0.1173, 0.6416, 0.4836, 0.2822, 0.1432, 0.3212, 0.3615, 0.0905, 0.4421, 0.1696, 0.2885, 0.4372, 0.1889, 0.2178, 0.4628, 0.1990, 0.6183, 0.7370, 0.5477, 0.2579, 0.4381, 0.0808, 0.6797, 0.6597, 0.6133, 0.1955, 0.4458, 0.0169, 0.3189, 0.2345, 0.1211, 0.1341} E 200 = {0.2537, 0.0565, 0.3591, 0.2826, 0.4176, 0.4199, 0.3831, 0.0202, 0.0413, 0.1918, 0.3185, 0.3927, 0.2446, 0.4920, 0.4310, 0.5812, 0.3188, 0.1951, 0.0634, 0.3666, 0.4673, 0.2541, 0.0545, 0.1599, 0.0922, 0.1686, 0.2641, 0.3163, 0.2745, 0.5252, 0.3108, 0.5662, 0.3826, 0.5746, 0.1444, 0.4057, 0.1734, 0.4031, 0.4171, 0.0408, 0.1529, 0.1344, 0.4007, 0.5066, 0.2067, 0.4683, 0.4052, 0.0040, 0.3613, 0.2321, 0.5496, 0.0007, 0.2775, 0.2546, 0.2765, 0.4621, 0.1935, 0.4708, 0.2828, 0.0215, 0.1055, 0.4331, 0.2841, 0.0916, 0.2047, 0.3644, 0.1150, 0.4431, 0.1457, 0.5505, 0.1614, 0.4593, 0.1132, 0.1725, 0.0547, 0.3457, 0.4100, 0.3280, 0.2554, 0.3867, 0.3886, 0.4074, 0.3815, 0.5671, 0.1254, 0.4256, 0.1417, 0.0716, 0.3644, 0.2701, 0.2752, 0.3972, 0.4622, 0.2101, 0.3972, 0.2497, 0.5052, 0.4998, 0.8000, 0.3681, 0.3493, 0.3244, 0.5220, 0.1589, 0.1908, 0.0715, 0.5639, 0.3873, 0.2877, 0.3836, 0.3268, 0.3884, 0.3263, 0.4326, 0.3135, 0.5962, 0.1312, 0.0635, 0.0658, 0.0382, 0.2427, 0.2690, 0.2195, 0.4581, 0.3767, 0.4632, 0.5597, 0.5836, 0.1152, 0.0833, 0.4178, 0.0563, 0.3152, 0.3182, 0.5167, 0.2909, 0.2361, 0.4029, 0.4448, 0.3120, 0.2086, 0.0900, 0.3517, 0.1573, 0.0267, 0.4530, 0.1457, 0.2654, 0.4127, 0.2155, 0.4418, 0.2368, 0.4100, 0.4224, 0.2654, 0.0117, 0.1985, 0.2546, 0.1622, 0.1182, 0.4930, 0.2580, 0.5327, 0.2347, 0.4615, 0.2381, 0.4851, 0.4530, 0.2264, 0.1296, 0.4742, 0.5696, 0.1965, 0.4028, 0.2632, 0.5001, 0.4613, 0.1004, 0.5172, 0.5939, 0.3087, 0.5306, 0.3528, 0.0929, 0.1199, 0.2442, 0.4492, 0.4954, 0.4740, 0.1911, 0.3204, 0.0540, 0.0670, 0.0818, 0.4072, 0.2971, 0.1138, 0.2970, 0.0886, 0.0330}
The differences between the two largest reward probabilities of each environment are 0.1053, 0.0507 and 0.2038, respectively.
The three newly generated environments are considered “more complex and confusing” due to the following reasons. Firstly, the number of available actions is far more than the five initial environments, i.e., the automaton has more options to choose from, which makes the environment more complex in nature. Secondly, there are many distracters (actions with probabilities close to the second highest probability, including the second highest) that make the environment more confusing. For example, in E 100−1, the optimal action is action 14, with a reward probability of 0.8, and the reward probability of second best action (action 89) is 0.6947. Apart from it, there are two actions with reward probabilities rather close to action 89, namely action 1 and action 4, with reward probabilities of 0.6660 and 0.6732, respectively, as illustrated in Fig. 4. In this case, the automaton must sample all the actions sufficiently to distinguish the optimal action from the distracters.
As a result, these environments are described as “more comlex and confusing” environments.
Rights and permissions
About this article
Cite this article
Ge, H., Jiang, W., Li, S. et al. A novel estimator based learning automata algorithm. Appl Intell 42, 262–275 (2015). https://doi.org/10.1007/s10489-014-0594-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-014-0594-1