A novel estimator based learning automata algorithm

Ge, Hao; Jiang, Wen; Li, Shenghong; Li, Jianhua; Wang, Yifan; Jing, Yuchun

doi:10.1007/s10489-014-0594-1

A novel estimator based learning automata algorithm

Published: 03 October 2014

Volume 42, pages 262–275, (2015)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Hao Ge¹,
Wen Jiang¹,
Shenghong Li¹,
Jianhua Li¹,
Yifan Wang¹ &
…
Yuchun Jing¹

392 Accesses
15 Citations
Explore all metrics

Abstract

Reinforcement learning is one of the subjects of Artificial Intelligence and learning automata have been considered as one of the most powerful tools in this research area. On the evolution of learning automata, the rate of convergence is the most primary goal of designing a learning algorithm. In this paper, we propose a deterministic-estimator based learning automata (LA) of which the estimate of each action is the upper bound of a confidence interval, rather than the Maximum Likelihood Estimate (MLE) that has been widely used in current schemes of Estimator LA. The philosophy here is to assign more confidence on actions that are selected only for a few times, so that the automaton is encouraged to explore the uncertain actions. When all the actions have been fully explored, the automaton behaves just like the Generalized Pursuit Algorithm. A refined analysis is presented to show the 𝜖-optimality of the proposed algorithm. It has been demonstrated by extensive simulations that the presented learning automaton (LA) is faster than any deterministic estimator learning automata that have been reported to date. Moreover, we extend our algorithm to the stochastic estimator schemes. It is also shown that the extended LA has achieved a significant performance improvement, comparing with the current state of the art algorithm of learning automata, especially in complex and confusing environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weight-Assignment Last-Position Elimination-Based Learning Automata

An Efficient Learning Automaton Scheme for Massive-Action Environments

Notes

A larger value of γ indicates the environment is complex and confusing, and vice versa. However, as we mentioned in Section 2.3, it’s hard to get a proper γ in practical applications. It’s unreasonable to assume we have these prior knowledge in the simulations either, so the comparison is somewhat unfair. Our algorithm is extended in the next Section to make a fair comparison with SE_RI.

References

Agache M, Oommen BJ (2002) Generalized pursuit learning schemes: new families of continuous and discretized learning automata. IEEE Trans Syst Man Cybern B Cybern 32(6):738–749
Article Google Scholar
Barto AG (1998) Reinforcement learning: an introduction. MIT press
Crofts AE (1982) On a property of the f distribution. Trabajos de Estadstica y de Investigacin Operativa 33(2):110–111
Article Google Scholar
Fathy F, Salek N, Masoudi Y, Laleh E (2013) Distributing of patterns in cutter machines boards using learning automata. In: International Conference on Communication Systems and Network Technologies (CSNT), 2013, pp 774–777
Jiang W, Zhao CL, Li SH, Chen L (2014) A new learning automata based approach for online tracking of event patterns. Neurocomputing 137:205–211
Article Google Scholar
Krishna PV, Misra S, Joshi D, Obaidat MS (2013) Learning automata based sentiment analysis for recommender system on cloud. In: International Conference on Computer, Information and Telecommunication Systems (CITS), 2013, IEEE, pp 1–5
Lanctôt JK, Oommen BJ (1992) Discretized estimator learning automata. IEEE Trans Syst Man Cybern 22(6):1473–1483
Article Google Scholar
Leemis LM, Trivedi KS (1996) A comparison of approximate interval estimators for the bernoulli parameter. The American Statistician 50(1):63–68
MathSciNet Google Scholar
Lukacs E (1970) Characteristic functions, vol 4. Griffin, London
Martin R, Tilak O, et al. (2012) On 𝜖-optimality of the pursuit learning algorithm. J Appl Probab 49(3):795–805
Article MATH MathSciNet Google Scholar
Misra S, Krishna P, Saritha V, Obaidat M (2013) Learning automata as a utility for power management in smart grids. IEEE Commun Mag 51(1):98–104
Article Google Scholar
Misra S, Krishna P, Kalaiselvan K, Saritha V, Obaidat M (2014) Learning automata-based qos framework for cloud iaas. IEEE Trans Netw Serv Manag 99:1–10
Google Scholar
Moradabadi B, Beigy H (2013) A new real-coded bayesian optimization algorithm based on a team of learning automata for continuous optimization. Genet Program Evolvable Mach:1–25
Narendra KS, Thathachar M (1974) Learning automata-a survey. IEEE Trans. Syst. Man Cybern. 4:323–334
Article MATH MathSciNet Google Scholar
Narendra KS, Thathachar MA (2012) Learning automata: an introduction. Courier Dover Publications
Oommen BJ, Agache M (2001) Continuous and discretized pursuit learning schemes: various algorithms and their comparison. IEEE Trans Syst Man Cybern B Cybern 31(3):277–287
Article Google Scholar
Oommen BJ, Lanctôt JK (1990) Discretized pursuit learning automata. IEEE Trans Syst Man Cybern 20(4):931–938
Article MATH Google Scholar
Papadimitriou GI (1994) A new approach to the design of reinforcement schemes for learning automata: stochastic estimator learning algorithms. IEEE Trans Knowl Data Eng 6(4):649–654
Article MathSciNet Google Scholar
Papadimitriou GI, Sklira M, Pomportsis AS (2004) A new class of ε-optimal learning automata. IEEE Trans Syst Man Cybern B Cybern 34(1):246–254
Article Google Scholar
Rasouli N, Meybodi M, Morshedlou H (2013) Virtual machine placement in cloud systems using learning automata. In: 13th Iranian conference on fuzzy systems (IFSC), 2013, pp 1–5
Rezvanian A, Rahmati M, Meybodi MR (2014) Sampling from complex networks using distributed learning automata. Physica A: Stat Mech Appl 396:224–234
Article Google Scholar
Sastry P (1985) Systems of learning automata: estimator algorithms applications. PhD thesis. Ph. D. Thesis, Dept. of Electrical Engineering, Indian Institute of Science, Bangalore
Google Scholar
Thathachar M, Oommen B (1979) Discretized reward-inaction learning automata. J Cybern Inf Sci 2(1):24–29
Google Scholar
Thathachar M, Sastry P (1985) A new approach to the design of reinforcement schemes for learning automata. IEEE Trans Syst Man Cybern 1:168–175
Article MathSciNet Google Scholar
Tsetlin M (1973) Automaton theory and modeling of biological systems. Academic Press
Tsetlin ML (1961) On the behavior of finite automata in random media. Avtomatika i Telemekhanika 22:1345–1354
Google Scholar
Varshavskii V, Vorontsova I (1963) On the behavior of stochastic automata with variable structure. Autom Remote Control 24(3):327
MathSciNet Google Scholar
Yazidi A, Granmo OC, Oommen B (2013) Learning-automaton-based online discovery and tracking of spatiotemporal event patterns. IEEE Trans Cybern 43(3):1118–1130
Article Google Scholar
Zhang J, Lina N, Chen X, Shangce G, Zheng T (2012) Inertial estimator learning automata. IEICE Trans Fundam Electron Commun Comput Sci 95(6):1041–1048
Article Google Scholar
Zhang X, Granmo OC, Oommen BJ (2013a) On incorporating the paradigms of discretization and bayesian estimation to create a new family of pursuit learning automata. Appl Intell 39(4):782–792
Article Google Scholar
Zhang X, Granmo OC, Oommen BJ, Jiao L (2013b) On using the theory of regular functions to prove the 𝜖-optimality of the continuous pursuit learning automaton. In: Recent Trends in Applied Artificial Intelligence. Springer, pp 262–271
Zhong W, Xu Y, Wang J, Li D, Tianfield H (2014) Adaptive mechanism design and game theoretic analysis of auction-driven dynamic spectrum access in cognitive radio networks. EURASIP J Wirel Commun Netw 2014(1):44
Article Google Scholar

Download references

Acknowledgments

This work is funded by National Science Foundation of China (61271316), 973 Program (2010CB731403, 2010CB731406, 2013CB329603,2013CB329605) of China, Key Laboratory for Shanghai Integrated Information Security Management Technology Research, and Chinese National Engineering Laboratory for Information Content Analysis Technology.

Author information

Authors and Affiliations

Department of Electronic Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, People’s Republic of China
Hao Ge, Wen Jiang, Shenghong Li, Jianhua Li, Yifan Wang & Yuchun Jing

Authors

Hao Ge
View author publications
You can also search for this author in PubMed Google Scholar
Wen Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Shenghong Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Li
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuchun Jing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shenghong Li.

Appendices

Appendix

1.1 A Generation Of The New Environments

Environments E ₁₀₀₋₁ and E ₁₀₀₋₂ have 100 actions each, and environment E ₂₀₀ has 200 actions. The actions’ reward probability is randomly generated under some restriction.

In E ₁₀₀₋₁, all the actions’ reward probability is generated following a uniform distribution from 0 to 0.7, and the 14th action’s reward probability is set to 0.8. In the same way, all the actions’ reward probability of E ₁₀₀₋₂ is generated following a uniform distribution from 0 to 0.75, and then the 14th action’s reward probability is set to 0.8. All the actions’ reward probability of E ₂₀₀ is generated by following a uniform distribution from 0 to 0.6, and then the 99th action’s reward probability is set to 0.8.

It is noted that it’s possible for a randomly generated environment to have identical highest reward probabilities, or to have two highest reward probabilities that are very close to each other. In such cases, the environment is either indistinguishable or too complex. In order to prevent this from happening, we set a single action’s reward to a higher probability than those with the random probabilities, as described in the previous paragraph. This ensures that the difference between the two highest reward probabilities is large enough, and thus the optimal action is unique and more evident.

B Complexity Validation Of The Generated Environments

Next we examine the generated environments to make sure they are complex and confusing enough. The actions’ reward probabilities of the three generated environments are listed as follows:

E ₁₀₀₋₁ = {0.6660, 0.6358, 0.6223, 0.6732, 0.3667, 0.2940, 0.3470, 0.5296, 0.5429, 0.2126, 0.2234, 0.4816, 0.5310, 0.8000, 0.6443, 0.1880, 0.6366, 0.1041, 0.5364, 0.2806, 0.1249, 0.2224, 0.1024, 0.2570, 0.3813, 0.0943, 0.3239, 0.6444, 0.0286, 0.2413, 0.4858, 0.0315, 0.1729, 0.4617, 0.2190, 0.4679, 0.4060, 0.6011, 0.3666, 0.5703, 0.4128, 0.1013, 0.1030, 0.1659, 0.1388, 0.0878, 0.4212, 0.0811, 0.4730, 0.2900, 0.6361, 0.5439, 0.5791, 0.2008, 0.0478, 0.4847, 0.1652, 0.1375, 0.4146, 0.1701, 0.5988, 0.3715, 0.0421, 0.1232, 0.5310, 0.5912, 0.3500, 0.1727, 0.1364, 0.3071, 0.3690, 0.5637, 0.4605, 0.2530, 0.2204, 0.5793, 0.5839, 0.3589, 0.3086, 0.2634, 0.4841, 0.4797, 0.4425, 0.4909, 0.4329, 0.0540, 0.4848, 0.5616, 0.6947, 0.4466, 0.3668, 0.1101, 0.4345, 0.5792, 0.3192, 0.5665, 0.0412, 0.2009, 0.3693, 0.0834} E ₁₀₀₋₂ = {0.0447, 0.5115, 0.0318, 0.0536, 0.3912, 0.0725, 0.6136, 0.6132, 0.5418, 0.1124, 0.4947, 0.3889, 0.7297, 0.8000, 0.6002, 0.3403, 0.3243, 0.6190, 0.0626, 0.0999, 0.1300, 0.2932, 0.6235, 0.6025, 0.0454, 0.2994, 0.3952, 0.3126, 0.4926, 0.4710, 0.2190, 0.3237, 0.0116, 0.7380, 0.1254, 0.0797, 0.2793, 0.1486, 0.3673, 0.2546, 0.7137, 0.6902, 0.0395, 0.5534, 0.2018, 0.3171, 0.4109, 0.7071, 0.3133, 0.7373, 0.2261, 0.5258, 0.4998, 0.4043, 0.5236, 0.4999, 0.1336, 0.0960, 0.7493, 0.1283, 0.0245, 0.4209, 0.6614, 0.5019, 0.1428, 0.2767, 0.3455, 0.7362, 0.1173, 0.6416, 0.4836, 0.2822, 0.1432, 0.3212, 0.3615, 0.0905, 0.4421, 0.1696, 0.2885, 0.4372, 0.1889, 0.2178, 0.4628, 0.1990, 0.6183, 0.7370, 0.5477, 0.2579, 0.4381, 0.0808, 0.6797, 0.6597, 0.6133, 0.1955, 0.4458, 0.0169, 0.3189, 0.2345, 0.1211, 0.1341} E ₂₀₀ = {0.2537, 0.0565, 0.3591, 0.2826, 0.4176, 0.4199, 0.3831, 0.0202, 0.0413, 0.1918, 0.3185, 0.3927, 0.2446, 0.4920, 0.4310, 0.5812, 0.3188, 0.1951, 0.0634, 0.3666, 0.4673, 0.2541, 0.0545, 0.1599, 0.0922, 0.1686, 0.2641, 0.3163, 0.2745, 0.5252, 0.3108, 0.5662, 0.3826, 0.5746, 0.1444, 0.4057, 0.1734, 0.4031, 0.4171, 0.0408, 0.1529, 0.1344, 0.4007, 0.5066, 0.2067, 0.4683, 0.4052, 0.0040, 0.3613, 0.2321, 0.5496, 0.0007, 0.2775, 0.2546, 0.2765, 0.4621, 0.1935, 0.4708, 0.2828, 0.0215, 0.1055, 0.4331, 0.2841, 0.0916, 0.2047, 0.3644, 0.1150, 0.4431, 0.1457, 0.5505, 0.1614, 0.4593, 0.1132, 0.1725, 0.0547, 0.3457, 0.4100, 0.3280, 0.2554, 0.3867, 0.3886, 0.4074, 0.3815, 0.5671, 0.1254, 0.4256, 0.1417, 0.0716, 0.3644, 0.2701, 0.2752, 0.3972, 0.4622, 0.2101, 0.3972, 0.2497, 0.5052, 0.4998, 0.8000, 0.3681, 0.3493, 0.3244, 0.5220, 0.1589, 0.1908, 0.0715, 0.5639, 0.3873, 0.2877, 0.3836, 0.3268, 0.3884, 0.3263, 0.4326, 0.3135, 0.5962, 0.1312, 0.0635, 0.0658, 0.0382, 0.2427, 0.2690, 0.2195, 0.4581, 0.3767, 0.4632, 0.5597, 0.5836, 0.1152, 0.0833, 0.4178, 0.0563, 0.3152, 0.3182, 0.5167, 0.2909, 0.2361, 0.4029, 0.4448, 0.3120, 0.2086, 0.0900, 0.3517, 0.1573, 0.0267, 0.4530, 0.1457, 0.2654, 0.4127, 0.2155, 0.4418, 0.2368, 0.4100, 0.4224, 0.2654, 0.0117, 0.1985, 0.2546, 0.1622, 0.1182, 0.4930, 0.2580, 0.5327, 0.2347, 0.4615, 0.2381, 0.4851, 0.4530, 0.2264, 0.1296, 0.4742, 0.5696, 0.1965, 0.4028, 0.2632, 0.5001, 0.4613, 0.1004, 0.5172, 0.5939, 0.3087, 0.5306, 0.3528, 0.0929, 0.1199, 0.2442, 0.4492, 0.4954, 0.4740, 0.1911, 0.3204, 0.0540, 0.0670, 0.0818, 0.4072, 0.2971, 0.1138, 0.2970, 0.0886, 0.0330}

The differences between the two largest reward probabilities of each environment are 0.1053, 0.0507 and 0.2038, respectively.

The three newly generated environments are considered “more complex and confusing” due to the following reasons. Firstly, the number of available actions is far more than the five initial environments, i.e., the automaton has more options to choose from, which makes the environment more complex in nature. Secondly, there are many distracters (actions with probabilities close to the second highest probability, including the second highest) that make the environment more confusing. For example, in E ₁₀₀₋₁, the optimal action is action 14, with a reward probability of 0.8, and the reward probability of second best action (action 89) is 0.6947. Apart from it, there are two actions with reward probabilities rather close to action 89, namely action 1 and action 4, with reward probabilities of 0.6660 and 0.6732, respectively, as illustrated in Fig. 4. In this case, the automaton must sample all the actions sufficiently to distinguish the optimal action from the distracters.

As a result, these environments are described as “more comlex and confusing” environments.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ge, H., Jiang, W., Li, S. et al. A novel estimator based learning automata algorithm. Appl Intell 42, 262–275 (2015). https://doi.org/10.1007/s10489-014-0594-1

Download citation

Published: 03 October 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s10489-014-0594-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel estimator based learning automata algorithm

Abstract

Access this article

Similar content being viewed by others

Weight-Assignment Last-Position Elimination-Based Learning Automata

An Efficient Learning Automaton Scheme for Massive-Action Environments

An Efficient Learning Automaton Scheme for Massive-Action Environments

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

1.1 A Generation Of The New Environments

B Complexity Validation Of The Generated Environments

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel estimator based learning automata algorithm

Abstract

Access this article

Similar content being viewed by others

Weight-Assignment Last-Position Elimination-Based Learning Automata

An Efficient Learning Automaton Scheme for Massive-Action Environments

An Efficient Learning Automaton Scheme for Massive-Action Environments

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

1.1 A Generation Of The New Environments

B Complexity Validation Of The Generated Environments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation