A learning search algorithm with propagational reinforcement learning

Zhang, Wei

doi:10.1007/s10489-020-02117-0

A learning search algorithm with propagational reinforcement learning

Published: 22 March 2021

Volume 51, pages 7990–8009, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Wei Zhang ORCID: orcid.org/0000-0003-3442-7697¹

367 Accesses
1 Altmetric
Explore all metrics

Abstract

When reinforcement learning with a deep neural network is applied to heuristic search, the search becomes a learning search. In a learning search system, there are two key components: (1) a deep neural network with sufficient expression ability as a heuristic function approximator that estimates the distance from any state to a goal; (2) a strategy to guide the interaction of an agent with its environment to obtain more efficient simulated experience to update the Q-value or V-value function of reinforcement learning. To date, neither component has been sufficiently discussed. This study theoretically discusses the size of a deep neural network for approximating a product function of p piecewise multivariate linear functions. The existence of such a deep neural network with O(n + p) layers and O(dn + dnp + dp) neurons has been proven, where d is the number of variables of the multivariate function being approximated, 𝜖 is the approximation error, and n = O(p + log₂(pd/𝜖)). For the second component, this study proposes a general propagational reinforcement-learning-based learning search method that improves the estimate h(.) according to the newly observed distance information about the goals, propagates the improvement bidirectionally in the search tree, and consequently obtains a sequence of more accurate V-values for a sequence of states. Experiments on the maze problems show that our method increases the convergence rate of reinforcement learning by a factor of 2.06 and reduces the number of learning episodes to 1/4 that of other nonpropagating methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Target-Driven Autonomous Robot Exploration in Mappless Indoor Environments Through Deep Reinforcement Learning

Learning to explore by reinforcement over high-level options

Article 30 November 2023

Route searching based on neural networks and heuristic reinforcement learning

Article 09 February 2017

Notes

See the related works in Section 3.1.
A learning loop here refers to a series of steps of RL starting with “Select an action and execute it”.
The main cause is “the high time complexity of single problem solving”, i.e., their one-off search algorithms are inefficient.
𝜖 − A^∗ takes the meaning of (and is similar to) 𝜖 − Greedy.
That is the “strategy” component mentioned in the abstract.
Here a represents an action.
See the proof of Theorem 2.
The proof is quite lengthy, and due to space limitations, it was moved to Appendix B.
In this paper, O(.) means the upper bound, Ω(.) is the lower bound, and Θ (.) is a tight bound (it is both the upper and lower bound).
This formula means that h_DNN(.) approximates h^∗(.) to logarithmic precision.
Here, R₁ = [0, c₁] ×… × [0, c_d], R₂ = (c₁, 1] ×… × (c_d, 1], and 0 < c_j < 1.
In the caption of Fig. 3, $\tilde {x}^{(1)}$ is the binary expansion of x⁽¹⁾, $\tilde {x}^{(1)}\approx x^{(1)}$, ${\tilde {h}}_{1}(\mathbf {x})$ = $h_{1}{\tilde {(\mathbf {x})}}$ ≈ h₁(x), where h₁(x) is an approximation of h1R(x). The definitions are given in the following paragraph.
We define $\tilde {h}_{l}(x^{(1)}, \ldots ,x^{(d)})$ $\triangleq $h_l(x~⁽¹⁾,…, x~^(d)) and can prove $\tilde {h}_{l}(x^{(1)}, \ldots ,x^{(d)})$ ≈ hlR(x⁽¹⁾,…, x^(d)) such that $|{h_{1}^{R}}(x^{(1)},\ldots {},$ $x^{(d)}) -\tilde {h}(x^{(1)},\ldots , x^{(d)})| \leq \varepsilon $ , see Theorem 1.
It is a 2-variable linear function.
The functions H_j(x, y) and H^∗(⋅) are defined below.
That is a part of the whole shortest distance from (x,y) to the goal. Here $t_{i} \triangleq (x_{-} t_{i}, y_{-} t_{i})$.
The $\max \limits \{0,x\}$ is defined as follows: if x ≥ 0, then $\max \limits \{0,x\}=x$; otherwise, $\max \limits \{0,x\}=0$.
Note that each H_i group has four neuron layers (including the input layer (x, y, Constants)), as shown in Fig. 6.
There are 8 H_i groups, each of which has 4 II(⋅) neurons and 4 ReLU neurons.
$H^{*}_{2}(x\_t_{1},y\_t_{1}$) computes the length of the shortest path from (R₁’s) island t₁ to (R₂’s) island t₂ within R₂. The “Const.” inputted to $H^{*}_{2}$ is t₁ = (x_t₁, y_t₁). Since island t₀ does not exist, there is no $H^{*}_{1}$ group. So, there are 7 $H^{*}_{i}$ groups.
That is: 8 $\sum $ units in the 8 H_i groups (in the 3^rd layer) and 7 $\sum $ units in the 7 $H^{*}_{i}$ groups.
That is the 4^th layer of the DNN.
There are 8 “H_i+ $\sum H^{*}_{i}$” computing units in the 5^th layer of the DNN in Fig. 7.
The last layer is used to select the outputs of the previous neuron groups, only one of the “Sgn(H_i)”s is not 0, where Sgn(H₈)=sgn(9 − x) ⋅ sgn(9 − y) ⋅ sgn(x − 7) ⋅ sgn(y − 3) representing the boundary constraints of R₈.
For the maze problem, it is the H^⋆(x, y).
According to the algorithms’ naming scheme in Section 6, the algorithm is called Algorithm 4, ALGO 4 in short.
That is the one-off search component of the RL algorithm is efficient.
In the simple modified A^∗ algorithm, a global variable F is introduced to record the expanded nodes’ maximum f -values. In the OPEN list, for nodes with f -values less than F, those with small g-values are first selected for expansion. In this way, some repeated node expansions can be avoided [3].
V-Learning represents the V-value learning based RL algorithm.
Here, we have p-value < α = 0.05, which is the significance level of the Wilcoxon rank-sum test, α = 0.05 corresponds to a confidence level of 95%.
The search tree of the Eight-puzzle problem, whose shortest solution path length is 23, includes more than 6500 nodes, whereas the search tree of the maze problem contains 100 nodes.
When solving Constrained Optimization problem, C is usually the radius of an L1-norm ball, in which the parameters are constrained to lie [8].
That is the required episode number to learn h(.) of all nodes on a path (x, t).

References

Silver D, Huang A, Maddison C, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529:484–489
Article Google Scholar
Korf R (1990) Real-time heuristic search. Artificial Intelligence 42:189–211
Article Google Scholar
Martelli A (1977) On the complexity of admissible search algorithms. Artificial Intelligence 8:1–13
Article MathSciNet Google Scholar
Mero L (1984) A heuristic search algorithm with modifiable estimate. Artificial Intelligence 23:13–27
Article MathSciNet Google Scholar
Zhang W, Yu R, He Z (1990) The lower bound on the worst-case complexity of admissible search algorithms. Chinese Journal of Computer 13:450–455
MathSciNet Google Scholar
Barto A, Bradtke S, Singh S (1995) Learning to act using real-time dynamic programming. Artificial Intelligence 72:21–138
Article Google Scholar
Pearl J (1983) Knowledge versus search: a quantitative analysis using a^∗. Artificial Intelligence 20:1–13
Article MathSciNet Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
MATH Google Scholar
Yarotsky D (2107) Error bounds for approximations with deep reLU networks. Neural Networks 94:103–114
Article Google Scholar
Mnih V, Kavukcuoglu K, Silver D, et al. (2015) Human-level control through deep reinforcement learning. Nature 518:529–533
Article Google Scholar
Schaul T, Horgan D, Gregor K (2015) Universal value function approximators. In: Proceedings of the 32nd international conference on machine learning. Lille, France, pp 1312–1320
Pan Y, Yao H, Farahmand A, et al. (2019) Hill climbing on value estimates for search-control in Dyna. In: Proceedings of the 28th international joint conference on artificial intelligence (IJCAI 2019). Macau, China, pp 3209–3215
Chen J, Sturtevant N (2019) Conditions for avoiding node re-expansions in bounded suboptimal search. In: Proceedings of the 28th international joint conference on artificial intelligence (IJCAI-2019). Macao, China, pp 1220–1226
Pan Y, Zaheer M, White A, et al. (2018) Organizing experience: a deeper look at replay mechanisms for samplebased planning in continuous state domains. In: Proceedings of the 27th international joint conference on artificial intelligence (IJCAI-2018). Stockholm, Sweden, pp 4794–4800
Thirugnanasambandam K, Prakash S, Subramanian V, et al. (2019) Reinforced cuckoo search algorithm-based multimodal optimization[J]. Applied Intelligence 49:2059–2083
Article Google Scholar
Xiong H, Huang L, Yu M, et al. (2020) On the number of linear regions of convolutional neural networks. In: Proceedings of the 37th international conference on machine learning (ICML-2020). Vienna, Austria. arXiv:2006.00978
Corneil D, Gerstner W, Brea J (2018) Efficient model-based deep reinforcement learning with variational state tabulation. In: Proceedings of the 35th international conference on machine learning (ICML-2018). Stockholm, Sweden, pp 1049–1058
Sutton R, Barto A (1998) Reinforcement learning: an introduction. IEEE Transactions on Neural Networks 9:1054–1054
Article Google Scholar
Nilsson N (1998) Artificial intelligence: a new synthesis. Morgan Kaufmann, San Francisco
MATH Google Scholar
Montufar G, Pascanu R, Cho K, et al. (2014) On the number of linear regions of deep neural networks. Advances in Neural Information Processing Systems 27:2924–2932
Google Scholar
Liang S, Srikant R (2017) Why deep neural networks for function approximation. In: Proceedings of the 5th international conference on learning representations. Toulon, France, pp 24–26
Zhang C, Bengio S, Hardt M, et al. (2017) Understanding deep learning requires rethinking generalization. In: Proceedings of the 5th international conference on learning representations (ICLR-2017). Toulon, France, pp 1–15
Yun C, Sra S, Jadbabaie A (2019) Small reLU networks are powerful memorizers: a tight analysis of memorization capacity (NIPS-2019). Vancouver, Canada, pp 15558–15569. arXiv:1810.07770
Raghu M, Poole B, Kleinberg J, et al. (2017) On the expressive power of deep neural networks. In: Proceedings of the 34th international conference on machine learning (ICML-2017). Sydney, Australia, pp 2847–2854
Andoni A, Panigrahy R, Valiant G, Zhang L (2014) Learning polynomials with neural networks. PMLR (Proceedings of Machine Learning Research) 32:1908–1916
Google Scholar
Lao Z, Zhou Z, Huang J (2017) Incremental extreme learning machine via fast random search Method[J]. Lecture Notes in Computer Science 10634:82–90
Article Google Scholar
Brookshear J (2002) Computer science: an overview. Addison-Wesley Longman Publishing Co., Boston
MATH Google Scholar
Nebel B (2000) On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research 12:271–315
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by Liaoning Provincial Natural Science Foundation of China (Grant No. 20170540311).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, China
Wei Zhang

Authors

Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem 1

Theorem 3

For the p-order multivariate piecewise function h(x⁽¹⁾,…, x^(d) ) defined by (4) and arbitrary error ε > 0, let (x⁽¹⁾,…, x^(d)) ∈ [0, 1]^d, w_i=(w_i1,…, w_id) be the coefficient vector of h_i(.) that constitutes h(x⁽¹⁾,…, x^(d) ) and satisfies ∥w_i∥₁ ≤ C, C is a constant. If h(x⁽¹⁾,…, x^(d) ) is differentiable, then there exists a DNN with O(n+p) layers and O(dn + dnp + dp ) neurons, implementing an $\tilde {h}(x^{(1)}, {\ldots } ,x^{(d)})$ such that

$$ |h(x^{(1)},\ldots{}, x^{(d)}) -\tilde{h}(x^{(1)},\ldots, x^{(d)})| \leq \varepsilon, $$

where $n=(p +1)\log C +\log [dp/\varepsilon ]$.

Proof

First, for each component x^(k) of the multivariate input vector (x⁽¹⁾,…, x^(d) ), we can obtain its binary expansion sequence {$x_{0}^{(k)}, {\ldots } ,x_{n}^{(k)}$} by means of an n-layer DNN with n BSP neurons. Denote the approximate value of x^(k) calculated by this sequence as ${\tilde {x}}^{(k)}$, that is,

$$ {\tilde{x}}^{(k)} =\mathrm{ x}_{0}^{(k)}/2^{0} +x_{1}^{(k)}/2^{1} +\ldots{}+x_{n}^{(k)}/2^{n} = {\sum}_{i = 0}^{n}{x_{i}^{(k)}/2^{i}}. $$

According to the definition of the n-bit binary expansion of a decimal fractional number, the error between ${\tilde {x}}^{(k)}$ and the original x^(k) satisfies |x^(k) -${\tilde {x}}^{(k)} |$ ≤ 1/2 ⁿ. Approximately implementing (i.e., calculating) one x^(k) requires n neurons, so implementing all x⁽¹⁾,…, x^(d) in parallel requires nd neurons. Therefore, the first n layers of the DNN in Fig. 3 have nd neurons, which are not displayed due to size limitations.

Second, the DNN recursively calculates $\tilde {h}(x^{(1)},{\ldots } ,$ x^(d)) through two channels to approximate h(x⁽¹⁾,…, x^(d)). If the input vector (x⁽¹⁾,…, x^(d)) ∈ [0, c₁] ×… × [0, c_d], the upper channel is switched on (π sgn(⋅)≠ 0), and the DNN outputs $\tilde {h}(x^{(1)},\ldots {}, x^{(d)}$ )=${\prod }_{i = 1}^{p}{(w_{i1}{\tilde {x}}^{(1)}} + {\ldots } + w_{\text {id}}{\tilde {x}}^{(d)}) $ (see (4)); otherwise, the lower channel is turned on, and the result is $\tilde {h}$(x⁽¹⁾,…, x^(d) ) =${\prod }_{i = 1}^{p}{(w_{i1}^{\prime }{\tilde {x}}^{(1)}} + {\ldots } + w_{\text {id}}^{\prime }{\tilde {x}}^{(d)})$.

Consider the case of (x⁽¹⁾,…, x^(d)) ∈ [0, c₁] ×… × [0, c_d]. According to the previous explanation of Fig. 3, the DNN recursively calculates the following terms from left to right in the figure in accordance with (11):

$$ \begin{aligned} & \{f_{1}(x^{(1)},\ldots{}, x^{(d)} ), \tilde{h}_{1}(x^{(1)},\ldots{}, x^{(d)} ) \}, \\ & \{f_{2}(x^{(1)},\ldots{}, x^{(d)} ), \tilde{h}_{2}(x^{(1)},\ldots{}, x^{(d)} ) \}, \\ & \ldots{}\ldots{}, \\ & \{f_{l+1} (x^{(1)},\ldots{}, x^{(d)} ), \tilde{h}_{l+1}(x^{(1)},\ldots{}, x^{(d)} ) \}, \\ & \ldots{}\ldots{} \\ & \{f_{p}(x^{(1)},\ldots{}, x^{(d)} ), \tilde{h}_{p}(x^{(1)},\ldots{}, x^{(d)} ) \}. \end{aligned} $$

where l = 0,…, p − 1, and ${\tilde {h}}_{l + 1}\left (\cdot \right )$ is an approximation of $h_{l + 1}^{R}(\cdot ) $ that is calculated according to (11) by replacing x with $\tilde {x}$. Define $h_{l + 1}^{R}\left (x^{\left (1 \right )}, {\ldots } ,x^{\left (d \right )} \right ) \approx \tilde {h}$_l+ 1(x⁽¹⁾,…, $ x^{(d)} ) \triangleq $ h_l_+ 1 ($\tilde {x}^{(1)}, \ldots {}, \tilde {x}^{(d)}$ ) and $\tilde {f}_{l+1} (x^{(1)},\ldots {}, x^{(d)} ) \triangleq f_{l+1}\tilde {x}^{(1)}, \ldots {}, \tilde {x}^{(d)}$ ):

$$ \begin{array}{@{}rcl@{}} \tilde{f}_{l+1} (x^{(1)},\ldots{}, x^{(d)} )\! & = &\! f_{l+1}(\tilde{x}^{(1)}, \ldots{}, \tilde{x}^{(d)} )\\ & = &\! w_{(l + 1)1} \cdot {\tilde{x}}^{(1)} \! + {\ldots} +\! w_{(l + 1)d} \cdot {\tilde{x}}^{(d)} \end{array} $$

$$ \begin{aligned} & = w_{(l + 1)1} \cdot {\sum}_{i = 0}^{n}\frac{x_{i}^{(1)}}{2^{i}} + {\ldots} + w_{(l + 1)d} \cdot {\sum}_{i = 0}^{n}\frac{x_{i}^{(d)}}{2^{i}} \\ & = w_{(l + 1)1} \cdot \max(0,{\sum}_{i = 0}^{n}\frac{x_{i}^{\left( 1 \right)}}{2^{i}}) + {\ldots} + w_{(l + 1)d} \\&\cdot \max(0,{\sum}_{i = 0}^{n}\frac{x_{i}^{(d)}}{2^{i}}) \end{aligned} $$

(A-1)

$$ \begin{aligned} &h_{l + 1}^{R}\left( x^{\left( 1 \right)}, {\ldots} ,x^{\left( d \right)} \right) \\& \approx \tilde{h}_{l+1}(x^{(1)},\ldots{}, x^{(d)}) = h_{l+1} (\tilde{x}^{(1)}, \ldots{}, \tilde{x}^{(d)} ) \\ & = \{ {\prod}_{j = 1}^{d}{\text{sgn}(c_{j} - {\tilde{x}}^{(j)})\rbrack \cdot f_{l + 1}({\tilde{x}}^{(1)}, {\ldots} ,{\tilde{x}}^{(d)})} \} \\&\cdot {\tilde{h}}_{l}({{x}}^{(1)}, {\ldots} ,{{x}}^{(d)}) \\ & = \{ {\prod}_{j = 1}^{d}\text{sgn}(c_{j} - {\tilde{x}}^{(j)})\rbrack \cdot {\sum}_{k = 1}^{d}w_{(l+1)k} \\&\cdot max(0,{\sum}_{i = 0}^{n}\frac{x_{i}^{(k)}}{2^{i}}) \} \cdot {\tilde{h}}_{l}({{x}}^{(1)}, {\ldots} ,{{x}}^{(d)}) \end{aligned} $$

(A-2)

In particular, we define ${\tilde {h}}_{0}\left (x^{(1)}, {\ldots } ,x^{(d)} \right ) = 1.$

It can be seen from (A-2) that $\tilde {h}_{l+1}(x^{(1)},\ldots {}, x^{(d)}$) is calculated according to (11), but ($x^{\left (1 \right )},\ldots ,x^{\left (d \right )})$ is replaced with (${\tilde {x}}^{\left (1 \right )},\ldots ,{\tilde {x}}^{\left (d \right )})$, so $\tilde {h}_{l+1}(x^{(1)},\ldots {}, x^{(d)}$) is an approximation of $h_{l + 1}^{R}\left (x^{\left (1 \right )}, {\ldots } ,x^{\left (d \right )} \right )$. That is $\tilde {h}$_l+ 1($x^{(1)},\ldots {}, x^{(d)} ) \approx h_{l + 1}^{R}\left (x^{\left (1 \right )},\ldots ,x^{\left (d \right )} \right ).$ Note that in the case of ($x_{0}^{(1)}, {\ldots } ,x_{n}^{(1)})$ falls in [0, c₁] ×… × [0, c_d]), there is π sgn(⋅) = 1 and [1- π sgn(⋅) ]= 0.

When $\left (l + 1 \right ) = p$, there is $\tilde {h}$_p($x^{(1)},\ldots {}, x^{(d)} ) \approx {h_{p}^{R}}\left (x^{\left (1 \right )}, {\ldots } ,x^{\left (d \right )} \right ) = h\left (x^{\left (1 \right )}, {\ldots } ,x^{\left (d \right )} \right )$, which is the desired. (see (10)).

In the first hidden layer of the DNN in Fig. 3, for the binary expansion sequence of each input variable (e.g., the ($x_{0}^{(1)}, {\ldots } ,x_{n}^{(1)})$ of x⁽¹⁾ ), there are n ReLU neurons directly transmitting the sequence to the next hidden layer, one ReLU neuron implementing the w₁₁ ⋅ max(0, ${\sum }_{i = 0}^{n}\frac {x_{i}^{(1)}}{2^{i}})$ term of (A-2), and one BSU neuron calculating the sgn($c^{(1)}-{\sum }_{i = 0}^{n}\frac {x_{i}^{(1)}}{2^{i}})$ term. For the d variables $x^{\left (1 \right )},\ldots ,x^{\left (d \right )}$, there are d such (n + 1 + 1) neuron groups in the first hidden layer of each channel, so there are 2d groups, and 2d ⋅ (n + 1 + 1) neurons in the two channels.

In addition, there are six neurons behind the first hidden layer that combine the previous sgn(.), w₁₁ ⋅max (.), $\tilde h_{0}(.)$ ($\tilde h_{0}$(.)= 1) according to (11) to form f₁(.) and $\tilde {h}$_l(.). (Note that for convenience, we also count these six neurons as the ones in the first hidden layer). These {2d ⋅(n + 2)+ 6} neurons implement $\tilde {h}$_l(.) and send (x⁽¹⁾,…, x^(d)) and $\tilde {h}$_l(.) to the second hidden layer that calculates f₂(.) and $\tilde {h}_{2}$(.) in the same way; … By analogy, f_p(.) and $\tilde {h}_{p}$(.) are finally obtained. The p hidden layers contain [{2d ⋅(n + 2)+ 6} ⋅ p] neurons. Adding the nd units representing the input variables, the whole DNN contains a total of (n + p) layers and [nd + {2d ⋅(n + 2)+ 6} ⋅ p ] neurons. Simply, it has O(n + p) layers and O(nd+dnp+dp ) neurons.

When the input vector (x⁽¹⁾,…, x^(d) )∉[0, c₁] ×… × [0, c_d], similarly, we can get the same conclusion.

Finally, we estimate the error of the DNN approximation:

$$ \begin{aligned} & \left| h(x^{(1)}, \ldots, x^{(d)}) - \tilde{h}(x^{(1)}, \ldots, x^{(d)}) \right| && ~\\ =& \left| h(x^{(1)}, \ldots, x^{(d)}) - h(\tilde{x}^{(1)}, \ldots, \tilde{x}^{(d)}) \right| && (\text{see (A-2)}) \\ =& \left| \frac{\partial h}{\partial x^{(1)}} {\varDelta} x^{(1)} + \frac{\partial h}{\partial x^{(2)}} {\varDelta} x^{(2)} + {\ldots} + \frac{\partial h}{\partial x^{(d)}} {\varDelta} x^{(d)} \right| && (\text{Lagrange's mean value theorem}\\ &&& \text{for multivariable functions}) \\ \\ =& \left| \left( \frac{\partial h}{\partial x^{(1)}}, \frac{\partial h}{\partial x^{(2)}} , {\ldots} , \frac{\partial h}{\partial x^{(d)}} \right) \cdot \left( {\varDelta} x^{(1)}, {\varDelta} x^{(2)}, \ldots, {\varDelta} x^{(d)} \right)\right| && (\text{Rewrite it into a dot product} \\ &&& \text{ of vectors.)} \end{aligned} $$

(A-3)

$$ \begin{array}{@{}rcl@{}} \text{where}~ {\varDelta} x^{(i)} & = & x^{(i)} - \tilde{x}^{(i)} > 0, i=1,\ldots, d,\\ &&\frac{\partial h}{\partial x^{(i)}} = \frac{\partial h(\tilde{x}^{(1)} + \theta {\varDelta} x^{(1)}, \ldots, \tilde{x}^{(d)} +\theta {\varDelta} x^{(d)})}{\partial x^{(i)}}, ~~0\!<\! \theta \!<\! 1 \end{array} $$

(A-4)

Calculate the partial derivative:

$$ \begin{aligned} \frac{\partial h}{\partial x^{(k)}} & = \frac{\partial h(\tilde{x}^{(1)} +\theta {\varDelta} x^{(1)}, \ldots, \tilde{x}^{(d)} +\theta {\varDelta} x^{(d)})}{\partial x^{(k)}} && \text{(change the name of the partial derivative} \\ &&& \text{variable to avoid variable conflict.)} \end{aligned} $$

$$ \begin{aligned} = & \frac{\partial \left\{ \prod\limits_{i=1}^p h_i(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(d)}) \right\}}{\partial x^{(k)}} && (\text{Substitute}~ h(.) ~\text{by (8), the bold}~ \mathbf{x}^{(1)} ~\text{in} \\ &&& \text{the numerator represents } \tilde{x}^{(1)}+ \theta{\varDelta} x^{(1)}, \text{that is,} \\ &&& \mathbf{x}^{(1)} = \tilde{x}^{(1)}+ \theta(x^{(1)} - \tilde{x}^{(1)} ). \text{ Note that in the light}\\ &&& \text{of the mean value theorem, there is } \tilde{x}^{(1)} < \mathbf{x}^{(1)}< x^{(1)} ) \end{aligned} $$

$$ \begin{aligned} = & \frac{\partial \left\{ \prod\limits_{i=1}^p \left[ \prod\limits_{j=1}^d \text{sgn}(c_j -\mathbf{x}^{(j)}) \right] \cdot f_i(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(d)}) \right\}}{\partial x^{(k)}} \\ &\qquad\qquad\qquad\qquad\qquad\qquad ~ (\text{Substitute}~ h_{i}(.) ~\text{by (6), and notice that} \\ &\qquad\qquad\qquad\qquad\qquad\qquad ~ (x^{(1)},\ldots{}, x^{(d)} ) ~\text{falls in}~ [0, c_{1}] \times{\ldots} \times [0, c_{d}], \\ & \qquad\qquad\qquad\qquad\qquad\qquad ~ \text{ so that}~ \lbrack 1 - {\prod}_{j = 1}^{d}{\text{sgn}(c_{j} - \mathbf{x}^{(j)})\rbrack = 0}.) \\ =& \frac{\partial\left\{ {\prod}_{i=1}^p f_i (\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(d)}) \right\}}{\partial x^{(k)}} \qquad\qquad (\text{Note that} {\prod}_{j = 1}^{d}{\text{sgn}(c_{j} - \mathbf{x}^{(j)}) = 1}) \end{aligned} $$

$$ \begin{aligned} &= \frac{\partial f_1(\cdot)}{\partial x^{(k)}} \cdot f_2(\cdot) \cdot {\ldots} \cdot f_p(\cdot) +f_1(\cdot) \frac{\partial f_2(\cdot)}{\partial x^{(k)}} \cdot f_3(\cdot) {\ldots} f_p(\cdot) + {\ldots} + f_1(\cdot) \cdot {\ldots} \cdot f_{p-1}(\cdot) \frac{\partial f_p(\cdot)}{\partial x^{(k)}} \\ & = w_{1k} \theta \cdot {\prod}_{i=2, i\neq 1}^p f_i(\cdot) + w_{2k} \theta \cdot {\prod}_{i=1, i\neq 2}^p f_i(\cdot) +\ldots + w_{pk} \theta \cdot {\prod}_{i=1, i\neq p}^p f_i(\cdot) \end{aligned} $$

$$ \begin{aligned} &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\text{ (Substitute}~ f_{i}\left( \mathbf{x}^{\left( 1 \right)}, {\ldots} ,\mathbf{x}^{\left( d \right)} \right) ~\text{ with } w_{i1} \cdot \mathbf{x}^{(1)} + {\ldots} + w_{\text{id}} \cdot \mathbf{x}^{(d)} ~\text{and} \\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\text{ calculate the partial derivative. The only non-zero term is} \\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad {w}_{jk}\textbf{x} {~}^{(k)} = w_{\text{jk}}(\tilde{x}^{(k)} + \theta(x^{(k)} - \tilde{x}^{(k)} )), ~\text{whose partial derivative is} {w}_{jk} \theta .) \end{aligned} $$

$$ \begin{aligned} & = \sum\limits_{j=1}^p w_{jk} \theta \cdot {\prod}_{i=1, i\neq j}^p f_i(\cdot) \\ & = \sum\limits_{j=1}^p w_{jk} \theta \cdot {\prod}_{i=1, i\neq j}^p \left( w_{i1} \mathbf{x}^{(1)} + w_{i2} \mathbf{x}^{(2)} +\ldots + w_{ik} \mathbf{x}^{(k)} \right)\\ &\qquad\qquad\qquad\qquad\qquad\text{(Substitute}~ f_{i}\left( \mathbf{x}^{\left( 1 \right)}, {\ldots} ,\mathbf{x}^{\left( d \right)} \right) \text{ with } w_{i1} \cdot \mathbf{x}^{(1)} + {\ldots} + w_{\text{id}} \cdot \mathbf{x}^{(d)}) \end{aligned} $$

Consider the absolute value of the partial derivative:

$$ \begin{aligned} & \left| \frac{\partial h(\tilde{x}^{(1)} +\theta {\varDelta} x^{(1)}, \ldots, \tilde{x}^{(k)} +\theta {\varDelta} x^{(k)})}{\partial x^{(k)}} \right| = \left| \sum\limits_{j=1}^p w_{jk} \theta \cdot {\prod}_{i=1, i\neq j}^p \left( w_{i1} \mathbf{x}^{(1)} + w_{i2} \mathbf{x}^{(2)} +\ldots + w_{ik} \mathbf{x}^{(k)} \right) \right| \\ \leq & \sum\limits_{j=1}^p \left| w_{jk} \right| \cdot \theta \cdot {\prod}_{i=1, i\neq j}^p \left( \left|w_{i1} \mathbf{x}^{(1)} \right| +\ldots + \left| w_{ik} \mathbf{x}^{(k)} \right| \right) \\ = & \sum\limits_{j=1}^p \left| w_{jk} \right| \cdot \theta \cdot {\prod}_{i=1, i\neq j}^p \left( \left|w_{i1} \right| +\ldots + \left| w_{ik} \right| \right) \qquad \text{(Note that $(x^{(1)},\ldots{}, x^{(d)}) \in [0,c_{1}] \times \ldots$ } \\ & \text{ $\times [0,c_{d}]$ and $0<\mathbf{x}^{(j)}<x^{(j)}<c_{j}<1)$ } \\ \leq & \sum\limits_{j=1}^p \left| w_{jk} \right| \cdot \theta \cdot C^p \qquad \text{(According to the conditions of Theorem 1, there is} \|\mathbf{w}_{i}\|_{1}\leq C. \end{aligned} $$

The meaning of C is typically the radius of the norm ball^{Footnote 32} in which the parameters w_i₁,…,w_id are constrained to lie.)

$$ \begin{aligned} \leq & pC \cdot \theta\cdot C^{p} \qquad\qquad~ (\|\mathbf{w}_{i}\|_{1}\leq C) \\ = & p \cdot{} \theta\cdot{} C^{p+1} \\ \leq & p\cdot{} C^{p+1} \qquad\qquad\quad \text{(In the light of the mean value theorem, there is} 0 < \theta < 1) \qquad\qquad \end{aligned} $$

Therefore, following (A-3), the error estimation can be obtained as follows.

$$ \begin{aligned} & \left| h(x^{(1)}, \ldots, x^{(d)}) - \tilde{h}(x^{(1)}, \ldots, x^{(d)}) \right| = \left| \left( \frac{\partial h}{\partial x^{(1)}}, \frac{\partial h}{\partial x^{(2)}} , {\ldots} , \frac{\partial h}{\partial x^{(d)}} \right) \cdot \left( {\varDelta} x^{(1)}, {\varDelta} x^{(2)}, \ldots, {\varDelta} x^{(d)} \right)\right| \\ \leq & \left\| \left( \frac{\partial h}{\partial x^{(1)}}, \frac{\partial h}{\partial x^{(2)}} , {\ldots} , \frac{\partial h}{\partial x^{(d)}} \right) \right\|_{2} \cdot \left\| \left( {\varDelta} x^{(1)}, {\varDelta} x^{(2)}, \ldots, {\varDelta} x^{(d)} \right)\right\|_{2} \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (Cauchy Schwarz inequality) } \end{aligned} $$

$$ \begin{aligned} =& \left\{ \left( \frac{\partial f}{\partial x_{1}} \right)^{2} + \ldots + \left( \frac{\partial f}{\partial x_{d}} \right)^{2} \right\}^{1/2} \cdot \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (According to the definition of} L^{2} \text{norm}, \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad calculate the first} L^{2} \text{norm. Combined with the} \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad previous discussion:} \frac{\partial f}{\partial x_{k}}\leq pC^{p+1}, \text{we have)} \end{aligned} $$

$$ \begin{aligned} \leq & \{ p^{2} C^{2(p+1)} +\ldots{}+p^{2} C^{2(p+1)} \}^{1/2} \cdot \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} \\ & \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad~~\text{ (The total number of} (\mathrm{p}^{2} \mathrm{C}^{2(p+1)} ) \text{terms is} {d}.) \\ = & pC^{p+1} \cdot \sqrt{d} \cdot \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} \end{aligned} $$

(A-5)

Calculate the second L² norm $\left \| \left ({\varDelta } x^{(1)}, \ldots , {\varDelta } x^{(d)} \right ) \right \|_{2}$:

$$ \begin{aligned} & \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} = \sqrt{ \left( {\varDelta} x^{(1)} \right)^{2} + {\ldots} +\left( {\varDelta} x^{(d)}\right)^{2} } , \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (By the definition of the \textit{n}-bit binary} \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad expansion of fractional number, there is)} \end{aligned} $$

${\varDelta } x^{(i)} = x^{(i)} -\tilde {x}^{(i)} \leq 1/2^{n}$, for i = 1,…, d, So we have

$$ \begin{aligned} & \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} = & \sqrt{\left( \frac{1}{2^{n}} \right)^{2} +\ldots + \left( \frac{1}{2^{n}} \right)^{2} } = & ~~\sqrt{d} \cdot \left( \frac{1}{2^{n}} \right) \end{aligned} $$

(A-6)

Combining this formula (about the second L ² norm) with (A-5), we get:

$$ \left| h(x^{(1)}, \ldots, x^{(d)}) - \tilde{h}(x^{(1)}, \ldots, x^{(d)}) \right| \!\leq\! pC^{p+1}\cdot{} d \cdot{} \left( \frac{1}{2^{n}} \right). $$

Let the right side be ε. So, $\varepsilon = pC^{p+1}\cdot {} d \cdot {} \left (\frac {1}{2^{n}} \right )$, and $n = (p+1) \log _{2} C + \log _{2} \left (\frac {pd}{\varepsilon } \right )$. Therefore, we can choose $n =(p + 1)\log _{2} C + \log _{2} \left (\frac {pd}{\varepsilon } \right )$ to make $\tilde {h}(x^{(1)}, \ldots , x^{(d)})$ approach h(x⁽¹⁾,…, x^(d)), satisfying

$$ \left| h(x^{(1)}, \ldots, x^{(d)}) - \tilde{h}(x^{(1)}, \ldots, x^{(d)}) \right| \leq \varepsilon. $$

Similarly, when the input vector (x⁽¹⁾,…, x^(d))∉[0, c₁] ×… × [0, c_d], the lower channel is turned on, and the DNN implements $\tilde {h}$(x⁽¹⁾,…, x^(d) ) =${\prod }_{i = 1}^{p}{(w_{i1}^{\prime }{\tilde {x}}^{(1)}} + {\ldots } + w_{\text {id}}^{\prime }{\tilde {x}}^{(d)})$. Under the same value of $n, \tilde {h}$(x⁽¹⁾,…, x^(d) ) still achieves an ε-approximation to h(x⁽¹⁾,…, x^(d)).

In summary, for the p-order multivariate piecewise function h(x⁽¹⁾,…, x^(d) ) described in this theorem, there exists a DNN with O(n + p) layers and O(dn + dnp + dp) neurons, which implements an $\tilde {h}(x^{(1)}, {\ldots } ,x^{(d)})$, satisfying:

$$ |h(x^{(1)},\ldots{}, x^{(d)}) - \tilde{h}(x^{(1)}, {\ldots} ,x^{(d)})| \leq \varepsilon, $$

where $n=(p +1) \log _{2} C + log_{2} [dp/\varepsilon ]$. □

Appendix B: Proof of Theorem 2

Proof

. The proof of this theorem comprises two parts.

(1) In each problem-solving episode employing ALGO 4, no nodes are re-expanded.

The formal proof of this part is rather long, and, therefore, only the main idea (i.e., proof by contradiction) has been included here for brevity. Let us assume that the node n₀ is expanded repeatedly when the propagational 𝜖 − A^∗ algorithm is applied. Thus, there must exist at least two paths − path₁ and path₂ − from the initial node s to n₀. Let us assume length(path₁) > length(path₂). Next, n₀ is first expanded once through path₁ (and the child node of n₀ − defined as n₁ − is placed onto the OPEN list). Subsequently, nodes are expanded through path₂ from s to n₀’s parent node x_k lying on the same path. Thus, a shorter path (path₂) is discovered to connect s and n₀, and n₀ is placed back onto the OPEN list. The backward pointer of n₀ points toward x_k for n₀ to be expanded once again. Because the algorithm first explores path₁, a node x_i must exist on path₂, and the value of its heuristic function f(x_i) exceeds those of f(.) of all nodes (including n₀) on path₁. That is, f(x_i) > f(n₀). Because the proposed search algorithm enables bidirectional propagation of heuristic information (to ensure all nodes in the search tree satisfy: h(parent) ≤ h(child) + c(parent, child)), the next time it encounters n₀ through path₂, it would propagate the larger f(x_i) value to x_k, n₀ (and its child node n₁) through path₂ such that f(x_i) = f(x_k) = f(n₀) = f(n₁). At this moment, both n₀ and n₁ remain on the OPEN list and satisfy f(n₀) = f(n₁), and g(n₀) < g(n₁) (g(n₀) denotes the known distance from s to n₀). Under this condition, the proposed algorithm would first expand n₁ but not n₀. This contradicts the assumption, and it follows that n₀ would not be expanded once again. Therefore, in the worst-case scenario of a problem-solving episode employing the propagation-enabling 𝜖 − A^∗ algorithm, all nodes with f-values smaller than the length of the optimal solution path f^∗(s, t) would only be expanded once. Thus, the maximum time complexity equals O(M). In contrast, the maximum time complexity of the ordinary A* algorithm equals $\mathrm {O}\left (2^{\mathrm {M}}\right )$ [3], whereas that of the simple modified A* algorithm (ALGO B) equals $\mathrm {O}\left (\mathrm {M}^{2}\right )$.

(2) The total time complexities of ALGO 4 and ALGO 3 equal O(M²) and $O(M^{3} \cdot \sqrt {M})$, respectively.

The time complexity of an RL algorithm is determined by two factors: (1) the time required for each problem-solving episode and (2) total number of episodes required to achieve convergence. The time complexity of each problem-solving episode when using the different algorithms is discussed above. Thus, this passage discusses the estimation of the number of episodes required to achieve convergence using the above-mentioned algorithms.

For any RL algorithm, the number of episodes required to achieve convergence (i.e., episode count) depends on factors such as the structure of the problem space and accuracy of the initial heuristic function. According to the method proposed by Korf to study the performance of the LRTA* algorithm when solving the maze problem with no obstacles [2], it is assumed that the state space of the maze problem comprises a matrix with N × N unit squares (denoted as M = N × N). The initial state s is placed in the upper-left corner of the matrix (square), while the target state t is placed in the lower-right corner. Additionally, it is also assumed that the initial heuristic function h(.) = 0. This follows that the ordinary V-Learning requires $\mathrm {O}\left (\mathrm {N}^{3}\right )(\mathrm {i} . \mathrm {e} ., \mathrm {O}(\mathrm {M} \cdot \sqrt {\mathrm {M}}))$ episodes to achieve convergence. This total episode count can be deduced as follows.

In a given state space, each state has a shortest path to the target. Moreover, there exist no obstacles in the matrix. Thus, the length of the solution path from the state s in the upper-left corner to the target state t equals 2N − 2. Additionally, compared to other solution paths in the state space, the state s is the furthest away from the target state t. In contrast, the length of the solution path from a state in the lower-right corner to the target state is zero, which is the shortest solution path. The average lengths of the optimum-solution paths for all states is N, while there exist N² states within the state space. It is assumed that the simple modified A* algorithm is applied for each problem-solving episode. Additionally, the V-Learning-based RL is adopted to learn the heuristic function h(.). When the maze problem is solved using RL the first time for a state x with the shortest solution path measuring N, the heuristic function h(.) is partially learned and set $h\left (x_{N-1}\right )=h^{*}\left (x_{N-1}\right )=1$ for the state x_N− 1 on the solution $path(x,t)=\left \{\mathrm {x}=\mathrm {x}_{0}, \mathrm {x}_{1}, \ldots , \mathrm {x}_{\mathrm {N}}=\mathrm {t}\right \}$. Similarly, when the problem is solved for the second time, we have $h\left (x_{N-2}\right )=h^{*}\left (x_{N-2}\right )=2$ for x_N− 2 on path(x, t). Eventually, when the problem is solved the N^th time, we have $h\left (x_{0}\right )=h^{*}\left (x_{0}\right )=N$. That is, to learn the heuristic functions h(.) = h^∗(.) of all nodes on a given path, the required number of episodes equals the length of the path. (In contrast, the propagation-enabling 𝜖 − A^∗ algorithm only needs to solve the problem once to learn the h(.) of a path). Therefore, when the simple modified A* algorithm is used in each problem-solving episode (ALGO 3), the number episodes required for convergence is given by

$$ \begin{array}{@{}rcl@{}} &&\text {Total episode count for ALGO 3}\\ &=& \text {average length of solution paths} \times \text{number of states} \\ &=&\mathrm{N} \times \mathrm{N}^{2}=\mathrm{O}\left( \mathrm{N}^{3}\right) \\ &=&\mathrm{O}(\mathrm{M} \cdot \sqrt{\mathrm{M}}) \end{array} $$

For the maze problem examined in this study, the state space S comprises a matrix with N × N (= M) unit squares with barriers (Fig. 4). S is divided into eight regions $\left (\mathrm {R}_{1}, \ldots , \mathrm {R}_{8}\right )$ by the barriers; thus, S = R₁ ∪R₂ ∪… ∪R₈. For any state x ∈ R₇ ∪ R₈, when x is located in the upper-left corner, its solution path is the longest (equal to N + N/2). In contrast, when x is located in the lower-right corner, its solution path becomes the shortest (zero). Therefore, the average solution-path length is L = 3N/4. The number of states in R₇ ∪ R₈ approximately is N × N/2 = N²/2. Because the proposed algorithm (ALGO 4) adopts the propagation-enabling 𝜖 − A^∗ algorithm during each problem-solving episode, it will cause the h(x_i) of all nodes x_i on the solution $path(x,t)=\left \{\mathrm {x}=\mathrm {x}_{0}, \mathrm {x}_{1}, \ldots , \mathrm {x}_{\mathrm {L}}=\mathrm {t}\right \}$ to equal h^∗(x_i) through a single problem-solving episode. For a detailed description of the propagation rules, readers are advised to please refer Step 7 of ALGO 4 and Section 2.2.2. Accordingly, in sub-spaces R₇ ∪R₈, the total number of problem-solving episodes required to achieve convergence approximately equals: (episode number required per solution path)^{Footnote 33}× (number of states) = $ 1 \times \left (\mathrm {N}^{2} / 2\right )=\mathrm {N}^{2} / 2$.

A similar analysis has been performed for sub-spaces R₁ ∪… ∪R₆, and it has been observed that, when x ∈R₁ ∪… ∪R₆, the average length of the solution path(x, t) is ((4N − 3) + (N/2 + N − 3))/2 = 11N/4 − 3. Thus, the propagation-enabling 𝜖 − A^∗ algorithm still yields $\mathrm {h}\left (\mathrm {x}_{\mathrm {i}}\right )=\mathrm {h}^{*}\left (\mathrm {x}_{\mathrm {i}}\right )$ for all nodes x_i on path(x, t) = $\left \{\mathrm {x}=\mathrm {x}_{0}, \mathrm {x}_{1}, \ldots , \mathrm {x}_{\mathrm {L}}=\mathrm {t}\right \}$ through a single problem-solving episode. In sub-spaces R₁ ∪… ∪R₆, the number of states approximately equals N²/2 (neglecting the states occupied by barriers). Thus, the number of episodes required for the propagation-enabling RL in R₁ ∪… ∪R₆ equals: (episode number required per solution path) × (number of states) = 1 × N²/2 = N²/2. Hence, for the state space S in the maze problem with barriers, ALGO 4 yields the following total number of episodes to achieve h(.) = h^∗(.) for all states.

$$ \begin{array}{@{}rcl@{}} &&\text {Total episode count for ALGO 4}\\ &=& \text {(number of problem-solving in } {R}_{7} \cup {R}_{8}) + \text{(number } \\ &\text{of}& \text {problem-solving in } {R}_{1} \cup {\ldots} \cup {R}_{6} ) \\ &=&{N}^{2} / 2+{N}^{2} / 2 = {N}^{2} = {O}({N}^{2}) = {O}({M}) \end{array} $$

In the state space S of the maze problem with barriers, a similar analysis has been performed on ALGO 3. Accordingly, it is observed that the total episodes count required equals $3 \mathrm {N}^{3}-3 \mathrm {N}^{2}=\mathrm {O}\left (\mathrm {N}^{3}\right )=\mathrm {O}(\mathrm {M} \cdot \sqrt {\mathrm {M}})$. The conclusion from this theorem can be drawn by multiplying the number of episodes required to achieve convergence when using ALGO 3 and ALGO 4 by their corresponding time complexities for a single problem-solving episode. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, W. A learning search algorithm with propagational reinforcement learning. Appl Intell 51, 7990–8009 (2021). https://doi.org/10.1007/s10489-020-02117-0

Download citation

Accepted: 02 December 2020
Published: 22 March 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s10489-020-02117-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A learning search algorithm with propagational reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Target-Driven Autonomous Robot Exploration in Mappless Indoor Environments Through Deep Reinforcement Learning

Learning to explore by reinforcement over high-level options

Route searching based on neural networks and heuristic reinforcement learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Proof of Theorem 1

Theorem 3

Proof

Appendix B: Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A learning search algorithm with propagational reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Target-Driven Autonomous Robot Exploration in Mappless Indoor Environments Through Deep Reinforcement Learning

Learning to explore by reinforcement over high-level options

Route searching based on neural networks and heuristic reinforcement learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Proof of Theorem 1

Theorem 3

Proof

Appendix B: Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation