Skip to main content
Log in

A learning search algorithm with propagational reinforcement learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

When reinforcement learning with a deep neural network is applied to heuristic search, the search becomes a learning search. In a learning search system, there are two key components: (1) a deep neural network with sufficient expression ability as a heuristic function approximator that estimates the distance from any state to a goal; (2) a strategy to guide the interaction of an agent with its environment to obtain more efficient simulated experience to update the Q-value or V-value function of reinforcement learning. To date, neither component has been sufficiently discussed. This study theoretically discusses the size of a deep neural network for approximating a product function of p piecewise multivariate linear functions. The existence of such a deep neural network with O(n + p) layers and O(dn + dnp + dp) neurons has been proven, where d is the number of variables of the multivariate function being approximated, 𝜖 is the approximation error, and n = O(p + log2(pd/𝜖)). For the second component, this study proposes a general propagational reinforcement-learning-based learning search method that improves the estimate h(.) according to the newly observed distance information about the goals, propagates the improvement bidirectionally in the search tree, and consequently obtains a sequence of more accurate V-values for a sequence of states. Experiments on the maze problems show that our method increases the convergence rate of reinforcement learning by a factor of 2.06 and reduces the number of learning episodes to 1/4 that of other nonpropagating methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. See the related works in Section 3.1.

  2. A learning loop here refers to a series of steps of RL starting with “Select an action and execute it”.

  3. The main cause is “the high time complexity of single problem solving”, i.e., their one-off search algorithms are inefficient.

  4. 𝜖A takes the meaning of (and is similar to) 𝜖Greedy.

  5. That is the “strategy” component mentioned in the abstract.

  6. Here a represents an action.

  7. See the proof of Theorem 2.

  8. The proof is quite lengthy, and due to space limitations, it was moved to Appendix B.

  9. In this paper, O(.) means the upper bound, Ω(.) is the lower bound, and Θ (.) is a tight bound (it is both the upper and lower bound).

  10. This formula means that hDNN(.) approximates h(.) to logarithmic precision.

  11. Here, R1 = [0, c1] ×… × [0, cd], R2 = (c1, 1] ×… × (cd, 1], and 0 < cj < 1.

  12. In the caption of Fig. 3, \(\tilde {x}^{(1)}\) is the binary expansion of x(1), \(\tilde {x}^{(1)}\approx x^{(1)}\), \({\tilde {h}}_{1}(\mathbf {x})\) = \(h_{1}{\tilde {(\mathbf {x})}}\)h1(x), where h1(x) is an approximation of h1R(x). The definitions are given in the following paragraph.

  13. We define \(\tilde {h}_{l}(x^{(1)}, \ldots ,x^{(d)})\) \(\triangleq \)hl(x~(1),…, x~(d)) and can prove \(\tilde {h}_{l}(x^{(1)}, \ldots ,x^{(d)})\)hlR(x(1),…, x(d)) such that \(|{h_{1}^{R}}(x^{(1)},\ldots {},\) \(x^{(d)}) -\tilde {h}(x^{(1)},\ldots , x^{(d)})| \leq \varepsilon \) , see Theorem 1.

  14. It is a 2-variable linear function.

  15. The functions Hj(x, y) and H(⋅) are defined below.

  16. That is a part of the whole shortest distance from (x,y) to the goal. Here \(t_{i} \triangleq (x_{-} t_{i}, y_{-} t_{i})\).

  17. The \(\max \limits \{0,x\}\) is defined as follows: if x ≥ 0, then \(\max \limits \{0,x\}=x\); otherwise, \(\max \limits \{0,x\}=0\).

  18. Note that each Hi group has four neuron layers (including the input layer (x, y, Constants)), as shown in Fig. 6.

  19. There are 8 Hi groups, each of which has 4 II(⋅) neurons and 4 ReLU neurons.

  20. \(H^{*}_{2}(x\_t_{1},y\_t_{1}\)) computes the length of the shortest path from (R1’s) island t1 to (R2’s) island t2 within R2. The “Const.” inputted to \(H^{*}_{2}\) is t1 = (x_t1, y_t1). Since island t0 does not exist, there is no \(H^{*}_{1}\) group. So, there are 7 \(H^{*}_{i}\) groups.

  21. That is: 8 \(\sum \) units in the 8 Hi groups (in the 3rd layer) and 7 \(\sum \) units in the 7 \(H^{*}_{i}\) groups.

  22. That is the 4th layer of the DNN.

  23. There are 8 “Hi+ \(\sum H^{*}_{i}\)” computing units in the 5th layer of the DNN in Fig. 7.

  24. The last layer is used to select the outputs of the previous neuron groups, only one of the “Sgn(Hi)”s is not 0, where Sgn(H8)=sgn(9 − x) ⋅ sgn(9 − y) ⋅ sgn(x − 7) ⋅ sgn(y − 3) representing the boundary constraints of R8.

  25. For the maze problem, it is the H(x, y).

  26. According to the algorithms’ naming scheme in Section 6, the algorithm is called Algorithm 4, ALGO 4 in short.

  27. That is the one-off search component of the RL algorithm is efficient.

  28. In the simple modified A algorithm, a global variable F is introduced to record the expanded nodes’ maximum f -values. In the OPEN list, for nodes with f -values less than F, those with small g-values are first selected for expansion. In this way, some repeated node expansions can be avoided [3].

  29. V-Learning represents the V-value learning based RL algorithm.

  30. Here, we have p-value < α = 0.05, which is the significance level of the Wilcoxon rank-sum test, α = 0.05 corresponds to a confidence level of 95%.

  31. The search tree of the Eight-puzzle problem, whose shortest solution path length is 23, includes more than 6500 nodes, whereas the search tree of the maze problem contains 100 nodes.

  32. When solving Constrained Optimization problem, C is usually the radius of an L1-norm ball, in which the parameters are constrained to lie [8].

  33. That is the required episode number to learn h(.) of all nodes on a path (x, t).

References

  1. Silver D, Huang A, Maddison C, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529:484–489

    Article  Google Scholar 

  2. Korf R (1990) Real-time heuristic search. Artificial Intelligence 42:189–211

    Article  Google Scholar 

  3. Martelli A (1977) On the complexity of admissible search algorithms. Artificial Intelligence 8:1–13

    Article  MathSciNet  Google Scholar 

  4. Mero L (1984) A heuristic search algorithm with modifiable estimate. Artificial Intelligence 23:13–27

    Article  MathSciNet  Google Scholar 

  5. Zhang W, Yu R, He Z (1990) The lower bound on the worst-case complexity of admissible search algorithms. Chinese Journal of Computer 13:450–455

    MathSciNet  Google Scholar 

  6. Barto A, Bradtke S, Singh S (1995) Learning to act using real-time dynamic programming. Artificial Intelligence 72:21–138

    Article  Google Scholar 

  7. Pearl J (1983) Knowledge versus search: a quantitative analysis using a. Artificial Intelligence 20:1–13

    Article  MathSciNet  Google Scholar 

  8. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge

    MATH  Google Scholar 

  9. Yarotsky D (2107) Error bounds for approximations with deep reLU networks. Neural Networks 94:103–114

    Article  Google Scholar 

  10. Mnih V, Kavukcuoglu K, Silver D, et al. (2015) Human-level control through deep reinforcement learning. Nature 518:529–533

    Article  Google Scholar 

  11. Schaul T, Horgan D, Gregor K (2015) Universal value function approximators. In: Proceedings of the 32nd international conference on machine learning. Lille, France, pp 1312–1320

  12. Pan Y, Yao H, Farahmand A, et al. (2019) Hill climbing on value estimates for search-control in Dyna. In: Proceedings of the 28th international joint conference on artificial intelligence (IJCAI 2019). Macau, China, pp 3209–3215

  13. Chen J, Sturtevant N (2019) Conditions for avoiding node re-expansions in bounded suboptimal search. In: Proceedings of the 28th international joint conference on artificial intelligence (IJCAI-2019). Macao, China, pp 1220–1226

  14. Pan Y, Zaheer M, White A, et al. (2018) Organizing experience: a deeper look at replay mechanisms for samplebased planning in continuous state domains. In: Proceedings of the 27th international joint conference on artificial intelligence (IJCAI-2018). Stockholm, Sweden, pp 4794–4800

  15. Thirugnanasambandam K, Prakash S, Subramanian V, et al. (2019) Reinforced cuckoo search algorithm-based multimodal optimization[J]. Applied Intelligence 49:2059–2083

    Article  Google Scholar 

  16. Xiong H, Huang L, Yu M, et al. (2020) On the number of linear regions of convolutional neural networks. In: Proceedings of the 37th international conference on machine learning (ICML-2020). Vienna, Austria. arXiv:2006.00978

  17. Corneil D, Gerstner W, Brea J (2018) Efficient model-based deep reinforcement learning with variational state tabulation. In: Proceedings of the 35th international conference on machine learning (ICML-2018). Stockholm, Sweden, pp 1049–1058

  18. Sutton R, Barto A (1998) Reinforcement learning: an introduction. IEEE Transactions on Neural Networks 9:1054–1054

    Article  Google Scholar 

  19. Nilsson N (1998) Artificial intelligence: a new synthesis. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  20. Montufar G, Pascanu R, Cho K, et al. (2014) On the number of linear regions of deep neural networks. Advances in Neural Information Processing Systems 27:2924–2932

    Google Scholar 

  21. Liang S, Srikant R (2017) Why deep neural networks for function approximation. In: Proceedings of the 5th international conference on learning representations. Toulon, France, pp 24–26

  22. Zhang C, Bengio S, Hardt M, et al. (2017) Understanding deep learning requires rethinking generalization. In: Proceedings of the 5th international conference on learning representations (ICLR-2017). Toulon, France, pp 1–15

  23. Yun C, Sra S, Jadbabaie A (2019) Small reLU networks are powerful memorizers: a tight analysis of memorization capacity (NIPS-2019). Vancouver, Canada, pp 15558–15569. arXiv:1810.07770

  24. Raghu M, Poole B, Kleinberg J, et al. (2017) On the expressive power of deep neural networks. In: Proceedings of the 34th international conference on machine learning (ICML-2017). Sydney, Australia, pp 2847–2854

  25. Andoni A, Panigrahy R, Valiant G, Zhang L (2014) Learning polynomials with neural networks. PMLR (Proceedings of Machine Learning Research) 32:1908–1916

    Google Scholar 

  26. Lao Z, Zhou Z, Huang J (2017) Incremental extreme learning machine via fast random search Method[J]. Lecture Notes in Computer Science 10634:82–90

    Article  Google Scholar 

  27. Brookshear J (2002) Computer science: an overview. Addison-Wesley Longman Publishing Co., Boston

    MATH  Google Scholar 

  28. Nebel B (2000) On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research 12:271–315

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by Liaoning Provincial Natural Science Foundation of China (Grant No. 20170540311).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem 1

Theorem 3

For the p-order multivariate piecewise function h(x(1),…, x(d) ) defined by (4) and arbitrary error ε > 0, let (x(1),…, x(d)) ∈ [0, 1]d, wi=(wi1,…, wid) be the coefficient vector of hi(.) that constitutes h(x(1),…, x(d) ) and satisfieswi1C, C is a constant. If h(x(1),…, x(d) ) is differentiable, then there exists a DNN with O(n+p) layers and O(dn + dnp + dp ) neurons, implementing an \(\tilde {h}(x^{(1)}, {\ldots } ,x^{(d)})\) such that

$$ |h(x^{(1)},\ldots{}, x^{(d)}) -\tilde{h}(x^{(1)},\ldots, x^{(d)})| \leq \varepsilon, $$

where \(n=(p +1)\log C +\log [dp/\varepsilon ]\).

Proof

First, for each component x(k) of the multivariate input vector (x(1),…, x(d) ), we can obtain its binary expansion sequence {\(x_{0}^{(k)}, {\ldots } ,x_{n}^{(k)}\)} by means of an n-layer DNN with n BSP neurons. Denote the approximate value of x(k) calculated by this sequence as \({\tilde {x}}^{(k)}\), that is,

$$ {\tilde{x}}^{(k)} =\mathrm{ x}_{0}^{(k)}/2^{0} +x_{1}^{(k)}/2^{1} +\ldots{}+x_{n}^{(k)}/2^{n} = {\sum}_{i = 0}^{n}{x_{i}^{(k)}/2^{i}}. $$

According to the definition of the n-bit binary expansion of a decimal fractional number, the error between \({\tilde {x}}^{(k)}\) and the original x(k) satisfies |x(k) -\({\tilde {x}}^{(k)} |\) ≤ 1/2 n. Approximately implementing (i.e., calculating) one x(k) requires n neurons, so implementing all x(1),…, x(d) in parallel requires nd neurons. Therefore, the first n layers of the DNN in Fig. 3 have nd neurons, which are not displayed due to size limitations.

Second, the DNN recursively calculates \(\tilde {h}(x^{(1)},{\ldots } ,\) x(d)) through two channels to approximate h(x(1),…, x(d)). If the input vector (x(1),…, x(d)) ∈ [0, c1] ×… × [0, cd], the upper channel is switched on (π sgn(⋅)≠ 0), and the DNN outputs \(\tilde {h}(x^{(1)},\ldots {}, x^{(d)}\) )=\({\prod }_{i = 1}^{p}{(w_{i1}{\tilde {x}}^{(1)}} + {\ldots } + w_{\text {id}}{\tilde {x}}^{(d)}) \) (see (4)); otherwise, the lower channel is turned on, and the result is \(\tilde {h}\)(x(1),…, x(d) ) =\({\prod }_{i = 1}^{p}{(w_{i1}^{\prime }{\tilde {x}}^{(1)}} + {\ldots } + w_{\text {id}}^{\prime }{\tilde {x}}^{(d)})\).

Consider the case of (x(1),…, x(d)) ∈ [0, c1] ×… × [0, cd]. According to the previous explanation of Fig. 3, the DNN recursively calculates the following terms from left to right in the figure in accordance with (11):

$$ \begin{aligned} & \{f_{1}(x^{(1)},\ldots{}, x^{(d)} ), \tilde{h}_{1}(x^{(1)},\ldots{}, x^{(d)} ) \}, \\ & \{f_{2}(x^{(1)},\ldots{}, x^{(d)} ), \tilde{h}_{2}(x^{(1)},\ldots{}, x^{(d)} ) \}, \\ & \ldots{}\ldots{}, \\ & \{f_{l+1} (x^{(1)},\ldots{}, x^{(d)} ), \tilde{h}_{l+1}(x^{(1)},\ldots{}, x^{(d)} ) \}, \\ & \ldots{}\ldots{} \\ & \{f_{p}(x^{(1)},\ldots{}, x^{(d)} ), \tilde{h}_{p}(x^{(1)},\ldots{}, x^{(d)} ) \}. \end{aligned} $$

where l = 0,…, p − 1, and \({\tilde {h}}_{l + 1}\left (\cdot \right )\) is an approximation of \(h_{l + 1}^{R}(\cdot ) \) that is calculated according to (11) by replacing x with \(\tilde {x}\). Define \(h_{l + 1}^{R}\left (x^{\left (1 \right )}, {\ldots } ,x^{\left (d \right )} \right ) \approx \tilde {h}\)l+ 1(x(1),…, \( x^{(d)} ) \triangleq \) hl+ 1 (\(\tilde {x}^{(1)}, \ldots {}, \tilde {x}^{(d)}\) ) and \(\tilde {f}_{l+1} (x^{(1)},\ldots {}, x^{(d)} ) \triangleq f_{l+1}\tilde {x}^{(1)}, \ldots {}, \tilde {x}^{(d)}\) ):

$$ \begin{array}{@{}rcl@{}} \tilde{f}_{l+1} (x^{(1)},\ldots{}, x^{(d)} )\! & = &\! f_{l+1}(\tilde{x}^{(1)}, \ldots{}, \tilde{x}^{(d)} )\\ & = &\! w_{(l + 1)1} \cdot {\tilde{x}}^{(1)} \! + {\ldots} +\! w_{(l + 1)d} \cdot {\tilde{x}}^{(d)} \end{array} $$
$$ \begin{aligned} & = w_{(l + 1)1} \cdot {\sum}_{i = 0}^{n}\frac{x_{i}^{(1)}}{2^{i}} + {\ldots} + w_{(l + 1)d} \cdot {\sum}_{i = 0}^{n}\frac{x_{i}^{(d)}}{2^{i}} \\ & = w_{(l + 1)1} \cdot \max(0,{\sum}_{i = 0}^{n}\frac{x_{i}^{\left( 1 \right)}}{2^{i}}) + {\ldots} + w_{(l + 1)d} \\&\cdot \max(0,{\sum}_{i = 0}^{n}\frac{x_{i}^{(d)}}{2^{i}}) \end{aligned} $$
(A-1)
$$ \begin{aligned} &h_{l + 1}^{R}\left( x^{\left( 1 \right)}, {\ldots} ,x^{\left( d \right)} \right) \\& \approx \tilde{h}_{l+1}(x^{(1)},\ldots{}, x^{(d)}) = h_{l+1} (\tilde{x}^{(1)}, \ldots{}, \tilde{x}^{(d)} ) \\ & = \{ {\prod}_{j = 1}^{d}{\text{sgn}(c_{j} - {\tilde{x}}^{(j)})\rbrack \cdot f_{l + 1}({\tilde{x}}^{(1)}, {\ldots} ,{\tilde{x}}^{(d)})} \} \\&\cdot {\tilde{h}}_{l}({{x}}^{(1)}, {\ldots} ,{{x}}^{(d)}) \\ & = \{ {\prod}_{j = 1}^{d}\text{sgn}(c_{j} - {\tilde{x}}^{(j)})\rbrack \cdot {\sum}_{k = 1}^{d}w_{(l+1)k} \\&\cdot max(0,{\sum}_{i = 0}^{n}\frac{x_{i}^{(k)}}{2^{i}}) \} \cdot {\tilde{h}}_{l}({{x}}^{(1)}, {\ldots} ,{{x}}^{(d)}) \end{aligned} $$
(A-2)

In particular, we define \({\tilde {h}}_{0}\left (x^{(1)}, {\ldots } ,x^{(d)} \right ) = 1.\)

It can be seen from (A-2) that \(\tilde {h}_{l+1}(x^{(1)},\ldots {}, x^{(d)}\)) is calculated according to (11), but (\(x^{\left (1 \right )},\ldots ,x^{\left (d \right )})\) is replaced with (\({\tilde {x}}^{\left (1 \right )},\ldots ,{\tilde {x}}^{\left (d \right )})\), so \(\tilde {h}_{l+1}(x^{(1)},\ldots {}, x^{(d)}\)) is an approximation of \(h_{l + 1}^{R}\left (x^{\left (1 \right )}, {\ldots } ,x^{\left (d \right )} \right )\). That is \(\tilde {h}\)l+ 1(\(x^{(1)},\ldots {}, x^{(d)} ) \approx h_{l + 1}^{R}\left (x^{\left (1 \right )},\ldots ,x^{\left (d \right )} \right ).\) Note that in the case of (\(x_{0}^{(1)}, {\ldots } ,x_{n}^{(1)})\) falls in [0, c1] ×… × [0, cd]), there is π sgn(⋅) = 1 and [1- π sgn(⋅) ]= 0.

When \(\left (l + 1 \right ) = p\), there is \(\tilde {h}\)p(\(x^{(1)},\ldots {}, x^{(d)} ) \approx {h_{p}^{R}}\left (x^{\left (1 \right )}, {\ldots } ,x^{\left (d \right )} \right ) = h\left (x^{\left (1 \right )}, {\ldots } ,x^{\left (d \right )} \right )\), which is the desired. (see (10)).

In the first hidden layer of the DNN in Fig. 3, for the binary expansion sequence of each input variable (e.g., the (\(x_{0}^{(1)}, {\ldots } ,x_{n}^{(1)})\) of x(1) ), there are n ReLU neurons directly transmitting the sequence to the next hidden layer, one ReLU neuron implementing the w11 ⋅ max(0, \({\sum }_{i = 0}^{n}\frac {x_{i}^{(1)}}{2^{i}})\) term of (A-2), and one BSU neuron calculating the sgn(\(c^{(1)}-{\sum }_{i = 0}^{n}\frac {x_{i}^{(1)}}{2^{i}})\) term. For the d variables \(x^{\left (1 \right )},\ldots ,x^{\left (d \right )}\), there are d such (n + 1 + 1) neuron groups in the first hidden layer of each channel, so there are 2d groups, and 2d ⋅ (n + 1 + 1) neurons in the two channels.

In addition, there are six neurons behind the first hidden layer that combine the previous sgn(.), w11 ⋅max (.), \(\tilde h_{0}(.)\) (\(\tilde h_{0}\)(.)= 1) according to (11) to form f1(.) and \(\tilde {h}\)l(.). (Note that for convenience, we also count these six neurons as the ones in the first hidden layer). These {2d ⋅(n + 2)+ 6} neurons implement \(\tilde {h}\)l(.) and send (x(1),…, x(d)) and \(\tilde {h}\)l(.) to the second hidden layer that calculates f2(.) and \(\tilde {h}_{2}\)(.) in the same way; … By analogy, fp(.) and \(\tilde {h}_{p}\)(.) are finally obtained. The p hidden layers contain [{2d ⋅(n + 2)+ 6} ⋅ p] neurons. Adding the nd units representing the input variables, the whole DNN contains a total of (n + p) layers and [nd + {2d ⋅(n + 2)+ 6} ⋅ p ] neurons. Simply, it has O(n + p) layers and O(nd+dnp+dp ) neurons.

When the input vector (x(1),…, x(d) )∉[0, c1] ×… × [0, cd], similarly, we can get the same conclusion.

Finally, we estimate the error of the DNN approximation:

$$ \begin{aligned} & \left| h(x^{(1)}, \ldots, x^{(d)}) - \tilde{h}(x^{(1)}, \ldots, x^{(d)}) \right| && ~\\ =& \left| h(x^{(1)}, \ldots, x^{(d)}) - h(\tilde{x}^{(1)}, \ldots, \tilde{x}^{(d)}) \right| && (\text{see (A-2)}) \\ =& \left| \frac{\partial h}{\partial x^{(1)}} {\varDelta} x^{(1)} + \frac{\partial h}{\partial x^{(2)}} {\varDelta} x^{(2)} + {\ldots} + \frac{\partial h}{\partial x^{(d)}} {\varDelta} x^{(d)} \right| && (\text{Lagrange's mean value theorem}\\ &&& \text{for multivariable functions}) \\ \\ =& \left| \left( \frac{\partial h}{\partial x^{(1)}}, \frac{\partial h}{\partial x^{(2)}} , {\ldots} , \frac{\partial h}{\partial x^{(d)}} \right) \cdot \left( {\varDelta} x^{(1)}, {\varDelta} x^{(2)}, \ldots, {\varDelta} x^{(d)} \right)\right| && (\text{Rewrite it into a dot product} \\ &&& \text{ of vectors.)} \end{aligned} $$
(A-3)
$$ \begin{array}{@{}rcl@{}} \text{where}~ {\varDelta} x^{(i)} & = & x^{(i)} - \tilde{x}^{(i)} > 0, i=1,\ldots, d,\\ &&\frac{\partial h}{\partial x^{(i)}} = \frac{\partial h(\tilde{x}^{(1)} + \theta {\varDelta} x^{(1)}, \ldots, \tilde{x}^{(d)} +\theta {\varDelta} x^{(d)})}{\partial x^{(i)}}, ~~0\!<\! \theta \!<\! 1 \end{array} $$
(A-4)

Calculate the partial derivative:

$$ \begin{aligned} \frac{\partial h}{\partial x^{(k)}} & = \frac{\partial h(\tilde{x}^{(1)} +\theta {\varDelta} x^{(1)}, \ldots, \tilde{x}^{(d)} +\theta {\varDelta} x^{(d)})}{\partial x^{(k)}} && \text{(change the name of the partial derivative} \\ &&& \text{variable to avoid variable conflict.)} \end{aligned} $$
$$ \begin{aligned} = & \frac{\partial \left\{ \prod\limits_{i=1}^p h_i(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(d)}) \right\}}{\partial x^{(k)}} && (\text{Substitute}~ h(.) ~\text{by (8), the bold}~ \mathbf{x}^{(1)} ~\text{in} \\ &&& \text{the numerator represents } \tilde{x}^{(1)}+ \theta{\varDelta} x^{(1)}, \text{that is,} \\ &&& \mathbf{x}^{(1)} = \tilde{x}^{(1)}+ \theta(x^{(1)} - \tilde{x}^{(1)} ). \text{ Note that in the light}\\ &&& \text{of the mean value theorem, there is } \tilde{x}^{(1)} < \mathbf{x}^{(1)}< x^{(1)} ) \end{aligned} $$
$$ \begin{aligned} = & \frac{\partial \left\{ \prod\limits_{i=1}^p \left[ \prod\limits_{j=1}^d \text{sgn}(c_j -\mathbf{x}^{(j)}) \right] \cdot f_i(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(d)}) \right\}}{\partial x^{(k)}} \\ &\qquad\qquad\qquad\qquad\qquad\qquad ~ (\text{Substitute}~ h_{i}(.) ~\text{by (6), and notice that} \\ &\qquad\qquad\qquad\qquad\qquad\qquad ~ (x^{(1)},\ldots{}, x^{(d)} ) ~\text{falls in}~ [0, c_{1}] \times{\ldots} \times [0, c_{d}], \\ & \qquad\qquad\qquad\qquad\qquad\qquad ~ \text{ so that}~ \lbrack 1 - {\prod}_{j = 1}^{d}{\text{sgn}(c_{j} - \mathbf{x}^{(j)})\rbrack = 0}.) \\ =& \frac{\partial\left\{ {\prod}_{i=1}^p f_i (\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(d)}) \right\}}{\partial x^{(k)}} \qquad\qquad (\text{Note that} {\prod}_{j = 1}^{d}{\text{sgn}(c_{j} - \mathbf{x}^{(j)}) = 1}) \end{aligned} $$
$$ \begin{aligned} &= \frac{\partial f_1(\cdot)}{\partial x^{(k)}} \cdot f_2(\cdot) \cdot {\ldots} \cdot f_p(\cdot) +f_1(\cdot) \frac{\partial f_2(\cdot)}{\partial x^{(k)}} \cdot f_3(\cdot) {\ldots} f_p(\cdot) + {\ldots} + f_1(\cdot) \cdot {\ldots} \cdot f_{p-1}(\cdot) \frac{\partial f_p(\cdot)}{\partial x^{(k)}} \\ & = w_{1k} \theta \cdot {\prod}_{i=2, i\neq 1}^p f_i(\cdot) + w_{2k} \theta \cdot {\prod}_{i=1, i\neq 2}^p f_i(\cdot) +\ldots + w_{pk} \theta \cdot {\prod}_{i=1, i\neq p}^p f_i(\cdot) \end{aligned} $$
$$ \begin{aligned} &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\text{ (Substitute}~ f_{i}\left( \mathbf{x}^{\left( 1 \right)}, {\ldots} ,\mathbf{x}^{\left( d \right)} \right) ~\text{ with } w_{i1} \cdot \mathbf{x}^{(1)} + {\ldots} + w_{\text{id}} \cdot \mathbf{x}^{(d)} ~\text{and} \\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\text{ calculate the partial derivative. The only non-zero term is} \\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad {w}_{jk}\textbf{x} {~}^{(k)} = w_{\text{jk}}(\tilde{x}^{(k)} + \theta(x^{(k)} - \tilde{x}^{(k)} )), ~\text{whose partial derivative is} {w}_{jk} \theta .) \end{aligned} $$
$$ \begin{aligned} & = \sum\limits_{j=1}^p w_{jk} \theta \cdot {\prod}_{i=1, i\neq j}^p f_i(\cdot) \\ & = \sum\limits_{j=1}^p w_{jk} \theta \cdot {\prod}_{i=1, i\neq j}^p \left( w_{i1} \mathbf{x}^{(1)} + w_{i2} \mathbf{x}^{(2)} +\ldots + w_{ik} \mathbf{x}^{(k)} \right)\\ &\qquad\qquad\qquad\qquad\qquad\text{(Substitute}~ f_{i}\left( \mathbf{x}^{\left( 1 \right)}, {\ldots} ,\mathbf{x}^{\left( d \right)} \right) \text{ with } w_{i1} \cdot \mathbf{x}^{(1)} + {\ldots} + w_{\text{id}} \cdot \mathbf{x}^{(d)}) \end{aligned} $$

Consider the absolute value of the partial derivative:

$$ \begin{aligned} & \left| \frac{\partial h(\tilde{x}^{(1)} +\theta {\varDelta} x^{(1)}, \ldots, \tilde{x}^{(k)} +\theta {\varDelta} x^{(k)})}{\partial x^{(k)}} \right| = \left| \sum\limits_{j=1}^p w_{jk} \theta \cdot {\prod}_{i=1, i\neq j}^p \left( w_{i1} \mathbf{x}^{(1)} + w_{i2} \mathbf{x}^{(2)} +\ldots + w_{ik} \mathbf{x}^{(k)} \right) \right| \\ \leq & \sum\limits_{j=1}^p \left| w_{jk} \right| \cdot \theta \cdot {\prod}_{i=1, i\neq j}^p \left( \left|w_{i1} \mathbf{x}^{(1)} \right| +\ldots + \left| w_{ik} \mathbf{x}^{(k)} \right| \right) \\ = & \sum\limits_{j=1}^p \left| w_{jk} \right| \cdot \theta \cdot {\prod}_{i=1, i\neq j}^p \left( \left|w_{i1} \right| +\ldots + \left| w_{ik} \right| \right) \qquad \text{(Note that $(x^{(1)},\ldots{}, x^{(d)}) \in [0,c_{1}] \times \ldots$ } \\ & \text{ $\times [0,c_{d}]$ and $0<\mathbf{x}^{(j)}<x^{(j)}<c_{j}<1)$ } \\ \leq & \sum\limits_{j=1}^p \left| w_{jk} \right| \cdot \theta \cdot C^p \qquad \text{(According to the conditions of Theorem 1, there is} \|\mathbf{w}_{i}\|_{1}\leq C. \end{aligned} $$

The meaning of C is typically the radius of the norm ballFootnote 32 in which the parameters wi1,…,wid are constrained to lie.)

$$ \begin{aligned} \leq & pC \cdot \theta\cdot C^{p} \qquad\qquad~ (\|\mathbf{w}_{i}\|_{1}\leq C) \\ = & p \cdot{} \theta\cdot{} C^{p+1} \\ \leq & p\cdot{} C^{p+1} \qquad\qquad\quad \text{(In the light of the mean value theorem, there is} 0 < \theta < 1) \qquad\qquad \end{aligned} $$

Therefore, following (A-3), the error estimation can be obtained as follows.

$$ \begin{aligned} & \left| h(x^{(1)}, \ldots, x^{(d)}) - \tilde{h}(x^{(1)}, \ldots, x^{(d)}) \right| = \left| \left( \frac{\partial h}{\partial x^{(1)}}, \frac{\partial h}{\partial x^{(2)}} , {\ldots} , \frac{\partial h}{\partial x^{(d)}} \right) \cdot \left( {\varDelta} x^{(1)}, {\varDelta} x^{(2)}, \ldots, {\varDelta} x^{(d)} \right)\right| \\ \leq & \left\| \left( \frac{\partial h}{\partial x^{(1)}}, \frac{\partial h}{\partial x^{(2)}} , {\ldots} , \frac{\partial h}{\partial x^{(d)}} \right) \right\|_{2} \cdot \left\| \left( {\varDelta} x^{(1)}, {\varDelta} x^{(2)}, \ldots, {\varDelta} x^{(d)} \right)\right\|_{2} \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (Cauchy Schwarz inequality) } \end{aligned} $$
$$ \begin{aligned} =& \left\{ \left( \frac{\partial f}{\partial x_{1}} \right)^{2} + \ldots + \left( \frac{\partial f}{\partial x_{d}} \right)^{2} \right\}^{1/2} \cdot \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (According to the definition of} L^{2} \text{norm}, \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad calculate the first} L^{2} \text{norm. Combined with the} \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad previous discussion:} \frac{\partial f}{\partial x_{k}}\leq pC^{p+1}, \text{we have)} \end{aligned} $$
$$ \begin{aligned} \leq & \{ p^{2} C^{2(p+1)} +\ldots{}+p^{2} C^{2(p+1)} \}^{1/2} \cdot \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} \\ & \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad~~\text{ (The total number of} (\mathrm{p}^{2} \mathrm{C}^{2(p+1)} ) \text{terms is} {d}.) \\ = & pC^{p+1} \cdot \sqrt{d} \cdot \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} \end{aligned} $$
(A-5)

Calculate the second L2 norm \(\left \| \left ({\varDelta } x^{(1)}, \ldots , {\varDelta } x^{(d)} \right ) \right \|_{2}\):

$$ \begin{aligned} & \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} = \sqrt{ \left( {\varDelta} x^{(1)} \right)^{2} + {\ldots} +\left( {\varDelta} x^{(d)}\right)^{2} } , \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (By the definition of the \textit{n}-bit binary} \\ & \text{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad expansion of fractional number, there is)} \end{aligned} $$

\({\varDelta } x^{(i)} = x^{(i)} -\tilde {x}^{(i)} \leq 1/2^{n}\), for i = 1,…, d, So we have

$$ \begin{aligned} & \left\| \left( {\varDelta} x^{(1)}, \ldots, {\varDelta} x^{(d)} \right) \right\|_{2} = & \sqrt{\left( \frac{1}{2^{n}} \right)^{2} +\ldots + \left( \frac{1}{2^{n}} \right)^{2} } = & ~~\sqrt{d} \cdot \left( \frac{1}{2^{n}} \right) \end{aligned} $$
(A-6)

Combining this formula (about the second L 2 norm) with (A-5), we get:

$$ \left| h(x^{(1)}, \ldots, x^{(d)}) - \tilde{h}(x^{(1)}, \ldots, x^{(d)}) \right| \!\leq\! pC^{p+1}\cdot{} d \cdot{} \left( \frac{1}{2^{n}} \right). $$

Let the right side be ε. So, \(\varepsilon = pC^{p+1}\cdot {} d \cdot {} \left (\frac {1}{2^{n}} \right )\), and \(n = (p+1) \log _{2} C + \log _{2} \left (\frac {pd}{\varepsilon } \right )\). Therefore, we can choose \(n =(p + 1)\log _{2} C + \log _{2} \left (\frac {pd}{\varepsilon } \right )\) to make \(\tilde {h}(x^{(1)}, \ldots , x^{(d)})\) approach h(x(1),…, x(d)), satisfying

$$ \left| h(x^{(1)}, \ldots, x^{(d)}) - \tilde{h}(x^{(1)}, \ldots, x^{(d)}) \right| \leq \varepsilon. $$

Similarly, when the input vector (x(1),…, x(d))∉[0, c1] ×… × [0, cd], the lower channel is turned on, and the DNN implements \(\tilde {h}\)(x(1),…, x(d) ) =\({\prod }_{i = 1}^{p}{(w_{i1}^{\prime }{\tilde {x}}^{(1)}} + {\ldots } + w_{\text {id}}^{\prime }{\tilde {x}}^{(d)})\). Under the same value of \(n, \tilde {h}\)(x(1),…, x(d) ) still achieves an ε-approximation to h(x(1),…, x(d)).

In summary, for the p-order multivariate piecewise function h(x(1),…, x(d) ) described in this theorem, there exists a DNN with O(n + p) layers and O(dn + dnp + dp) neurons, which implements an \(\tilde {h}(x^{(1)}, {\ldots } ,x^{(d)})\), satisfying:

$$ |h(x^{(1)},\ldots{}, x^{(d)}) - \tilde{h}(x^{(1)}, {\ldots} ,x^{(d)})| \leq \varepsilon, $$

where \(n=(p +1) \log _{2} C + log_{2} [dp/\varepsilon ]\). □

Appendix B: Proof of Theorem 2

Proof

. The proof of this theorem comprises two parts.

(1) In each problem-solving episode employing ALGO 4, no nodes are re-expanded.

The formal proof of this part is rather long, and, therefore, only the main idea (i.e., proof by contradiction) has been included here for brevity. Let us assume that the node n0 is expanded repeatedly when the propagational 𝜖A algorithm is applied. Thus, there must exist at least two paths − path1 and path2 − from the initial node s to n0. Let us assume length(path1) > length(path2). Next, n0 is first expanded once through path1 (and the child node of n0 − defined as n1 − is placed onto the OPEN list). Subsequently, nodes are expanded through path2 from s to n0’s parent node xk lying on the same path. Thus, a shorter path (path2) is discovered to connect s and n0, and n0 is placed back onto the OPEN list. The backward pointer of n0 points toward xk for n0 to be expanded once again. Because the algorithm first explores path1, a node xi must exist on path2, and the value of its heuristic function f(xi) exceeds those of f(.) of all nodes (including n0) on path1. That is, f(xi) > f(n0). Because the proposed search algorithm enables bidirectional propagation of heuristic information (to ensure all nodes in the search tree satisfy: h(parent) ≤ h(child) + c(parent, child)), the next time it encounters n0 through path2, it would propagate the larger f(xi) value to xk, n0 (and its child node n1) through path2 such that f(xi) = f(xk) = f(n0) = f(n1). At this moment, both n0 and n1 remain on the OPEN list and satisfy f(n0) = f(n1), and g(n0) < g(n1) (g(n0) denotes the known distance from s to n0). Under this condition, the proposed algorithm would first expand n1 but not n0. This contradicts the assumption, and it follows that n0 would not be expanded once again. Therefore, in the worst-case scenario of a problem-solving episode employing the propagation-enabling 𝜖A algorithm, all nodes with f-values smaller than the length of the optimal solution path f(s, t) would only be expanded once. Thus, the maximum time complexity equals O(M). In contrast, the maximum time complexity of the ordinary A* algorithm equals \(\mathrm {O}\left (2^{\mathrm {M}}\right )\) [3], whereas that of the simple modified A* algorithm (ALGO B) equals \(\mathrm {O}\left (\mathrm {M}^{2}\right )\).

(2) The total time complexities of ALGO 4 and ALGO 3 equal O(M2) and \(O(M^{3} \cdot \sqrt {M})\), respectively.

The time complexity of an RL algorithm is determined by two factors: (1) the time required for each problem-solving episode and (2) total number of episodes required to achieve convergence. The time complexity of each problem-solving episode when using the different algorithms is discussed above. Thus, this passage discusses the estimation of the number of episodes required to achieve convergence using the above-mentioned algorithms.

For any RL algorithm, the number of episodes required to achieve convergence (i.e., episode count) depends on factors such as the structure of the problem space and accuracy of the initial heuristic function. According to the method proposed by Korf to study the performance of the LRTA* algorithm when solving the maze problem with no obstacles [2], it is assumed that the state space of the maze problem comprises a matrix with N × N unit squares (denoted as M = N × N). The initial state s is placed in the upper-left corner of the matrix (square), while the target state t is placed in the lower-right corner. Additionally, it is also assumed that the initial heuristic function h(.) = 0. This follows that the ordinary V-Learning requires \(\mathrm {O}\left (\mathrm {N}^{3}\right )(\mathrm {i} . \mathrm {e} ., \mathrm {O}(\mathrm {M} \cdot \sqrt {\mathrm {M}}))\) episodes to achieve convergence. This total episode count can be deduced as follows.

In a given state space, each state has a shortest path to the target. Moreover, there exist no obstacles in the matrix. Thus, the length of the solution path from the state s in the upper-left corner to the target state t equals 2N − 2. Additionally, compared to other solution paths in the state space, the state s is the furthest away from the target state t. In contrast, the length of the solution path from a state in the lower-right corner to the target state is zero, which is the shortest solution path. The average lengths of the optimum-solution paths for all states is N, while there exist N2 states within the state space. It is assumed that the simple modified A* algorithm is applied for each problem-solving episode. Additionally, the V-Learning-based RL is adopted to learn the heuristic function h(.). When the maze problem is solved using RL the first time for a state x with the shortest solution path measuring N, the heuristic function h(.) is partially learned and set \(h\left (x_{N-1}\right )=h^{*}\left (x_{N-1}\right )=1\) for the state xN− 1 on the solution \(path(x,t)=\left \{\mathrm {x}=\mathrm {x}_{0}, \mathrm {x}_{1}, \ldots , \mathrm {x}_{\mathrm {N}}=\mathrm {t}\right \}\). Similarly, when the problem is solved for the second time, we have \(h\left (x_{N-2}\right )=h^{*}\left (x_{N-2}\right )=2\) for xN− 2 on path(x, t). Eventually, when the problem is solved the Nth time, we have \(h\left (x_{0}\right )=h^{*}\left (x_{0}\right )=N\). That is, to learn the heuristic functions h(.) = h(.) of all nodes on a given path, the required number of episodes equals the length of the path. (In contrast, the propagation-enabling 𝜖A algorithm only needs to solve the problem once to learn the h(.) of a path). Therefore, when the simple modified A* algorithm is used in each problem-solving episode (ALGO 3), the number episodes required for convergence is given by

$$ \begin{array}{@{}rcl@{}} &&\text {Total episode count for ALGO 3}\\ &=& \text {average length of solution paths} \times \text{number of states} \\ &=&\mathrm{N} \times \mathrm{N}^{2}=\mathrm{O}\left( \mathrm{N}^{3}\right) \\ &=&\mathrm{O}(\mathrm{M} \cdot \sqrt{\mathrm{M}}) \end{array} $$

For the maze problem examined in this study, the state space S comprises a matrix with N × N (= M) unit squares with barriers (Fig. 4). S is divided into eight regions \(\left (\mathrm {R}_{1}, \ldots , \mathrm {R}_{8}\right )\) by the barriers; thus, S = R1 ∪R2 ∪… ∪R8. For any state xR7R8, when x is located in the upper-left corner, its solution path is the longest (equal to N + N/2). In contrast, when x is located in the lower-right corner, its solution path becomes the shortest (zero). Therefore, the average solution-path length is L = 3N/4. The number of states in R7R8 approximately is N × N/2 = N2/2. Because the proposed algorithm (ALGO 4) adopts the propagation-enabling 𝜖A algorithm during each problem-solving episode, it will cause the h(xi) of all nodes xi on the solution \(path(x,t)=\left \{\mathrm {x}=\mathrm {x}_{0}, \mathrm {x}_{1}, \ldots , \mathrm {x}_{\mathrm {L}}=\mathrm {t}\right \}\) to equal h(xi) through a single problem-solving episode. For a detailed description of the propagation rules, readers are advised to please refer Step 7 of ALGO 4 and Section 2.2.2. Accordingly, in sub-spaces R7 ∪R8, the total number of problem-solving episodes required to achieve convergence approximately equals: (episode number required per solution path)Footnote 33× (number of states) = \( 1 \times \left (\mathrm {N}^{2} / 2\right )=\mathrm {N}^{2} / 2\).

A similar analysis has been performed for sub-spaces R1 ∪… ∪R6, and it has been observed that, when x ∈R1 ∪… ∪R6, the average length of the solution path(x, t) is ((4N − 3) + (N/2 + N − 3))/2 = 11N/4 − 3. Thus, the propagation-enabling 𝜖A algorithm still yields \(\mathrm {h}\left (\mathrm {x}_{\mathrm {i}}\right )=\mathrm {h}^{*}\left (\mathrm {x}_{\mathrm {i}}\right )\) for all nodes xi on path(x, t) = \(\left \{\mathrm {x}=\mathrm {x}_{0}, \mathrm {x}_{1}, \ldots , \mathrm {x}_{\mathrm {L}}=\mathrm {t}\right \}\) through a single problem-solving episode. In sub-spaces R1 ∪… ∪R6, the number of states approximately equals N2/2 (neglecting the states occupied by barriers). Thus, the number of episodes required for the propagation-enabling RL in R1 ∪… ∪R6 equals: (episode number required per solution path) × (number of states) = 1 × N2/2 = N2/2. Hence, for the state space S in the maze problem with barriers, ALGO 4 yields the following total number of episodes to achieve h(.) = h(.) for all states.

$$ \begin{array}{@{}rcl@{}} &&\text {Total episode count for ALGO 4}\\ &=& \text {(number of problem-solving in } {R}_{7} \cup {R}_{8}) + \text{(number } \\ &\text{of}& \text {problem-solving in } {R}_{1} \cup {\ldots} \cup {R}_{6} ) \\ &=&{N}^{2} / 2+{N}^{2} / 2 = {N}^{2} = {O}({N}^{2}) = {O}({M}) \end{array} $$

In the state space S of the maze problem with barriers, a similar analysis has been performed on ALGO 3. Accordingly, it is observed that the total episodes count required equals \(3 \mathrm {N}^{3}-3 \mathrm {N}^{2}=\mathrm {O}\left (\mathrm {N}^{3}\right )=\mathrm {O}(\mathrm {M} \cdot \sqrt {\mathrm {M}})\). The conclusion from this theorem can be drawn by multiplying the number of episodes required to achieve convergence when using ALGO 3 and ALGO 4 by their corresponding time complexities for a single problem-solving episode. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W. A learning search algorithm with propagational reinforcement learning. Appl Intell 51, 7990–8009 (2021). https://doi.org/10.1007/s10489-020-02117-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-02117-0

Keywords

Navigation