Dynamic path learning in decision trees using contextual bandits

We present a novel online decision-making solution, where the optimal path of a given decision tree is dynamically found based on the contextual bandits analysis. At each round, the learner finds a path in the decision tree by making a sequence of decisions following the tree structure and receives an outcome when a terminal node is reached. At each decision node, the environment information is observed to hint on which child node to visit, resulting in a better outcome. The objective is to learn the context-specific optimal decision for each decision node to maximize the accumulated outcome. In this paper, we propose Dynamic Path Identifier (DPI), a learning algorithm where the contextual bandit is applied to every decision node, and the observed outcome is used as the reward of the previous decisions of the same round. The technical difficulty of DPI is the high exploration challenge caused by the width (i.e., the number of paths) of the tree as well as the large context space. We mathematically prove that DPI’s regret per round approached zero as the number of the rounds approaches infinity. We also prove that the regret is not a function of the number of paths in the tree. Numerical evaluations are provided to complement the theoretical analysis.


Introduction
Multi-staged strategic decision-making in heterogeneous network (e.g., management, business investment, e-commerce, and strategic game) requires agents to utilize the current environment to estimate the long-term outcome at each step of the decision-making. Oftentimes, the agent can utilize a pre-constructed decision tree [1] (e.g., Figure 1) to clarify this process, where the relationships among the decisions are characterized as the topological structure of a decision tree. In each decision node (the square nodes in Figure 1), an action is chosen (i.e., select a child node) based on the observed environment information, which is referred to as the context. Each terminal node (triangle nodes) marks the end of a decision-making process, in which an outcome is observed. One decision-making process (i.e., from the root to a terminal node) is referred to as a round. An interesting observation is that in heterogeneous scenarios, the context in each decision node provides valuable information to determine which action will most likely to a larger outcome in the terminal node. Therefore, learning the relationship between the context of each decision node and the final outcome of the terminal node will significantly improve the decision -making process.
Inspired by strategic decision-making, we present and formulate a novel online decision-making solution where we aim to dynamically find the context-specific optimal path of a given tree to maximize its outcome. This is achieved by learning the optimal contextspecific action for each decision node through processing a series of rounds while minimizing the regret. At each round, a learner makes a sequence of decisions following the structure of the decision tree: Starting from the root, the learner observes a context whenever a decision node is reached. The value of the context can assist the learner to choose a decision that may result in a better outcome. The learner repeats this process following the structure of the tree until a terminal node is reached.
Although the decision tree approach has been extensively studied over the years [2][3][4], none of them can directly solve the aforementioned problem. Most studies focus on learning a decision tree model offline from a given dataset with several variables, which is fundamentally a classification problem. This branch of study is different from our study, and is referred to as the decision tree learning. Another line of study focus on the scenario where the topological structure of the tree is known while the optimal path is not deterministic due to randomness [5,6]. Some studies identify the upper confidence bound of each action at each decision node [7]. This approach is not optimal as it does not consider Fig. 1 A business decision tree example where the agent decides the selling strategy. At each decision node, the agent observes the current business status to modify the price. The final profit is obtained on the terminal nodes. Note that we allow the decision nodes to have different degrees the environment information. In other studies, chance nodes are introduced to indicate the likelihood of the environment information. However, the likelihood is oftentimes unknown, and the environment information cannot always be represented by chance nodes (e.g., continuous environment information).
Therefore in our study, we observe the environment information as the context and learn its likelihood so that the path can be found dynamically at each decision node, which to the best of our knowledge, is the first in the literature. Finding the optimal path of a decision tree using the contextual bandits leads to a technical difficulty of high exploration challenge. Many decision trees have complex structures with exponential number of paths to explore, potentially increasing the regret during the process of exploration. Moreover, on each decision node, there exists a large amount of possible contexts, which is caused by the heterogeneity of the application.
To this end, we propose Dynamic Path Identifier (DPI), an online learning algorithm based on the contextual bandits approach to dynamically find the context-specific optimal path of a given decision tree. Whenever a decision node is reached, DPI promptly decides which child node to go to based on the observed contexts. As the number of the rounds increases, DPI learns the optimal decision at each decision node while addressing the aforementioned challenge. We mathematically prove that the regret of DPI per round approaches zero as the round number approaches infinity: For F rounds, the regret of DPI is no more than O (log F) where D and C are the depth of the tree and the maximum degree respectively. This indicates that even if a tree has a large number of paths, the regret only depends on D and C. We conduct a series of simulations to prove the robustness and effectiveness of DPI compared to the typical MAB schemes under a variety of environments. Furthermore, the theoretical performance bound is demonstrated by our numerical evaluation.

Multi-armed bandits (MAB) and contextual bandit
Multi-armed bandits (MAB) is an efficient decision-making model where a learner explores and exploits a set of arms at the same time with the aim of maximizing its accumulated reward in a series of trials [8][9][10]. Because of its light-weight implementation and provable performance, MAB is used to solve many practical problems, such as advertising [11], news article recommendation [12], and peer assessment [13].
MAB algorithms are very versatile and can be adapted into a variety of scenarios. For example, some studies assume a linear relationship between the actions and the reward [14][15][16], which is not applicable in our study; some studies investigate rotting bandit, where the expected reward of each arm decays as a function of the times it has been selected [17,18]; sleeping bandit [19,20] investigates the scenario where the actions may not be available at all times; continuous-armed bandit [21][22][23] allows the application to have an infinity amount of actions; cascading bandit [24,25] allows the user to select the first attractive item out of a list of items.
Among all the variations, contextual bandit is the most relevant to our study. A category of contextual bandit problems assume a linear relationship among the contexts, actions and the rewards [26][27][28][29][30], which does not match the assumption in our study. To implement the contextual bandit in more general scenarios, many studies adopt policy classes [31][32][33] to map the context space to the action space. However, the computational cost of each oracle call is too large if we apply the same approach in our work. Another branch of the study assumes Lipschitz condition on the expected rewards, which is adapted by our study. Standard Lipschitz contextual bandit is first introduced in [34]. [35,36] further reduce the regret by adopting a zooming algorithm to adaptively partition the context space. To the best of our knowledge, DPI is the first to find the optimal path in a decision tree using the contextual bandits.

Decision making in complex networks
Another line of study that is similar to our work is to investigate the optimality in complex networks. In particular, network representation learning [37][38][39] aims to learn the vector representation for each node in a complex network to solve network analysis problems, such as link prediction and node classification. Cai et al. [37] focus on learning a robust node representation with an adaptive Laplacian smoothing by developing an autoencoder. Li et al. [38] propose a novel deep attributed network that emphasizes capturing the coupling and interaction information in complex networks. Yang et al. [39] address the exploration limitation in the existing Heterogeneous Information Network (HIN) methods by designing a novel Heterogeneous Graph Convolutional Network to learn the representations. Cai et al. [40] consider a real-life scenario where they aim to optimize targeted advertisements in a spatial social network by considering influence maximization.
However, our study is different from this line of study because the decision making in a tree is sequential and has temporal dependency among decisions. Specifically, the decision made from the decision nodes that are closer to the root of the tree will affect the outcome observed at leaf nodes. The aforementioned studies about complex networks do not have this dependency among decisions, therefore their solutions cannot be directly replicated to solve our problem. To the best of our knowledge, our study is the first to dynamically find the optimal decision of each decision node based on its observed context on that decision node.

Problem formulation
We formally provide the problem formulation of our study. Following a standard approach, we assume there are F rounds in total, which is fixed and known in advance. This assumption is supported by the well-known doubling trick [41], where it converts a bandit algorithm with a fixed time horizon into one with an infinity horizon. A table of notations is provided in Table 1.

Decision tree
Let G = (N, E) represent the known structure of the tree. N = {N d,k | d = 1, … , D, k = 1, … , K d } is the node set, where D is the depth of the tree and K d is the total number of the nodes on height d = 1, … , D . We use label k = 1, … , K d to uniquely identify each node on height d. E is the edge set. Let N d and N t contain all the decision nodes and terminal nodes respectively. We have N d ∪ N t = N . For each decision node N d,k ∈ N d , let set C d,k contain all the child nodes of N d,k . Let A d,k be a set that contains all the possible actions of N d,k , indicating which child node to visit. We have A d,k = {1, … , | C d,k |} . Let function (N d,k , a) find the label of the child node of Let C be the maximum degree among all decision nodes (i.e., the maximum number of child). We have The decision node set, the terminal node set, and the node set N d,k A node with depth d and label k C d,k The child node set of N d,k A d,k A set that contains all the possible actions of N d,k Finds the label of the child node of N d,k after taking a x i,d,k The context observed at node N d,k at round i x i The context vector drawn at round i i The outcome vector drawn at round i The outcome at any terminal node N d,k ∈ N t a i,d,k The chosen action at node N d,k at round i P i The path selected at round î The label of the chosen node at the height d i The outcome observed at the terminal node of P i The expected context-specific reward of taking a at N d,k a d,k The context-specific optimal action at node The optimal expected reward at node N d,k The deviation at the node N d,k at round i R(F) The regret up to F rounds B.node The node that segment B belongs to B.a The action that segment B contains B. = (y, a) The center (i.e., the representative point) of segment B B.r The The expected reward of B i,d

Context, outcome and reward
At any round i = 1, … , F , let x i,d,k be the context observed at node N d,k . We assume x i,d,k is continuous and The observed context at each node follows an unknown distribution, which depends on the previously observed contexts. Without loss of generality, we assume before each round, the contexts for all decision node are drawn together from a fixed but unknown distribution. We the context to be revealed at the decision node N d,k ∈ N d . Vector x i is independent and identically distributed (i.i.d.) with respect to i, while the elements in the vectors are dependent.
At round i, whenever the learner reaches a decision node N d,k ∈ N d , x i,d,k is revealed to assist the learner to choose a child node. Let a i,d,k be the action that the learner chooses at node N d,k . Once a terminal node is reached, a path P i is selected. Let d i be the height of path P i . For any Once a path is completed, the contexts or outcomes of the rest of the nodes are not revealed. Let the observed outcome of round i be ̂i , which is the outcome observed at the terminal node of P i . We use the outcome set i ∼ Π(x i ) to evaluate the performance of all the actions in tree T by introducing the rewards of the actions. We define d,k (x i,d,k , a) as the expected context-specific reward of taking action a at decision node N d,k ∈ N d , where its value indicated the maximum expected outcome the learner obtains by taking action a at N d,k . If after taking action a, node N d,k transits to a terminal node , and the optimal expected reward is defined as * Ideally, we aim to choose the context-specific optimal action at every decision node. However, the expected context-specific rewards are unknown, where they have to be learned gradually from the observations. For any node N d,k ∈ P i , let the realized reward of taking action a i,d,k be Note that the realized rewards of the decision nodes that are outside the selected path remain unknown.

The objective
The problem of finding the context-specific optimal path can be formulated as a multi-staged contextual bandit problem, where we find the context-specific optimal action for every decision node given its context. Whenever the learner arrives at a decision node, there exists a context-specific optimal action ( a * d,k (x i,d,k ) ) and an action chosen by the learner ( a i,d,k ). If the chosen action is non-optimal, a deviation is generated. Define the deviation at the node N d,k . We define the regret up to F rounds as the accumulation of the deviation of non-optimal actions.

Definition 2
The regret up to F rounds is defined as the accumulation of the deviation of non-optimal actions where {⋅} is an indicator function.
Specifically, the regret represents the overall deviation of all of the non-optimal actions taken in F rounds, which is the sum of the deviation observed at each chosen decision node. With DPI, we prove that the regret per round (i.e., R(F)/F) asymptotically approaches to zero as F → ∞.

DPI scheme
In this section, we present the Dynamic Path Identifier (DPI) scheme. DPI consists of three modules: a toolkit, an oracle, and a tree assessment module, where their relationships are shown in Figure 2. The toolkit stores the current estimations of the rewards of every action for all decision nodes; the oracle utilizes the observed context whenever a decision node is arrived to decides which child node to go to after analyzing the toolkit; if a terminal node is reached, the tree assessment updates and trains the toolkit. We introduce the toolkit, oracle, and the tree assessment in Sections 4.1, 4.2, and 4.3 respectively. A detailed diagram that illustrates the decision process of DPI is provided in the Section 4.4.

Segment and toolkit
DPI adopts a toolkit to assist the oracle to make a decision whenever a decision node is arrived. It stores a series of segments to be trained by the tree assessment, and their values will be converted to the context-specific optimal action by the oracle. The definition of the segments will be introduced shortly in this section. For every decision node N d,k ∈ N d , the toolkit stores a context-action space P d,k that associates the observed context x i,d,k and the possible actions a ∈ A d,k at this decision node. It has two dimensions: the x-axis is the context x i,d,k ∈ [0, 1] and y-axis indicates the action a ∈ A d,k , hence each context-action space can be visualized as | A d,k | parallel line segments with length of 1. Any point (x, a) ∈ P d,k represents a context-action pair, indicating action a ∈ A d,k is chosen when context x is observed at N d,k . Let D (x, a), (x � , a � ) be the Euclidean distance between two points (x, a) and (x � , a � ) . Each point (x, a) ∈ P d,k has an expected reward d,k (x, a) . We assume that Lipschitz condition hold in the context-action space for any two points (x, a), (x � , a � ) ∈ P d,k for any We introduce "segments" to group together the neighboring points, and we use the performance of one point (center) to represent every other point in this segment, where their performance can be bounded by their distances to the center. The definition of "segments" is shown in Structure 1. Each segment has a number of static variables, where their values do not change over time. For segment B, its Node (B.node) 2 indicates the node this segment belongs to; its action (B.a) shows the action that B contains; its center ( B. ) indicates its representative point; its radius (B.r) shows the maximum distance between any point to the center; its range (B.range) shows an interval of points that this segment includes.
Each segment B also has a number of non-static variables, which include the inference history and will be trained over time: its exploitation index (B.v) and exploration index (B.c) indicate the worthiness of exploiting the current best action and exploring new possibilities at the time of the inference; its assessment index (B.U) is a combination of v and c, representing the overall worthiness of selecting this segment. In Section 4.3, we will prove that B.U is a loose upper confidence bound on the expected reward of the center of B; its accumulated reward (B.W) and selected time (B.N) records the inference history of B and are used to calculate B.v, B.c, and B.U, which will be detailed in Section 4.3. Every time when a new segment is created, we initialize its static variables and non-static variables as shown in Line 3. From B.U, we can further obtain an oracle index I(B), which is a tighter upper confidence bound for every point in its segment. This will be detailed in Section 4.2.
We initialize the toolkit in Algorithm 2 given the structure of the tree. For any N d,k ∈ N d , we initialize | A d,k | amount of segments to cover up all the points in P d,k . New segments with smaller radius could be created by the tree assessment when new rounds are processed. Let Y d,k be the segment set that contains all the segments created at N d,k (line 4).
The toolkit T includes all the segment sets (line 5).
The main function of DPI can be found in Algorithm 3. We first initialize the toolkit (Line 1). For each round, starting from the root node (Line 3), it observes the current context and then the oracle selects an action (choose a child node). The system records the selected segment and the observed context and then it moves to its child node (Line 5). If the child node is a terminal node, tree assessment observes the outcome of round i and trains the toolkit (Line 6).

Oracle (Function oracle(⋅))
When a decision node N d,k is reached at round i, the oracle observes its context and selects one segment B ⊆ Y d,k to get its action as the selected action a i,d,k . The pseudo-code of oracle making a decision for N d,k on round i is in Algorithm 4. Oracle first identifies the domain of every segment in Y d,k (Line 3). We define the domain of B as the range of B excluding the range of all the other segments in Y d,k with strictly smaller radius. The domain of B serves as a refined grouping of the neighboring contexts compared to B.range.
Oracle then identifies the relevant segment set R d,k given x i,d,k (Line 4). R d,k is defined as a set of segments that contain points (x i,d,k , a) in their domains, where ∀a ∈ A d,k . We refer to these segments as relevant segments. The next step is to calculate the oracle index for all segments in R d,k as described in Line 5. Note that D(B, B � ) is the Euclidean distance between B. and B ′ .

Tree assessment (Function assessment(⋅))
When the learner reaches a terminal node, tree assessment uses the realized outcome ̂i observed at the terminal node to train T . The pseudo-code of the tree assessment training the toolkit at round i can be found in Algorithm 5. For each selected segment in round i ( B i,d ∈B i ), we find the label of the decision node that B i,d belongs to (Line 3). For the sake of presentation, we assign the label as (d, k). Let the realized reward of B i,d be the realized reward of point (x i,d,k , a i,d,k ) , which according to (2), equals to the observed outcome ̂i (Line 3). Let (B i,d ) be the expected reward of B i,d , which is defined as the expected reward of the center of B i,d . We have We then update the non-static variables of B i,d (Line 3). We first updates (B i,d ). W and (B i,d ).N . We then update the exploitation index, which is defined as the average realized reward of B i,d before the ( i + 1)-th round. A larger (B i,d ).v means B i,d is more likely to have a larger expected reward (B i,d ) based on precious observations. The exploration index is updated, where its value bounds the difference between (B i,d ). v and (B i,d ) . A larger (B i,d ).c means segment B i,d has yet to be assessed much, which in turn favors exploring the child nodes that are yet to be explored in the past rounds. The next step is to update (B i,d ).U . Let the assessment index of segment B i,d be the sum of (B i,d ). v and (B i,d ).c plus uncertainty (the radius of the segment). In Section 5, we will prove that (B i,d ). U ≥ (B i,d ) . The value of (B i,d ).U is used by the oracle to obtain a tighter performance bound I(B i,d ).
After updating the non-static variables, if (B i,d ).c ≤ (B i,d ).r , the tree assessment creates a new segment to obtain a tighter bound for future estimations. Let the center of the new segment be (x i,d,k , (B i,d ).a) , and the radius be 1 2 (B i,d ).r . The process is shown in Lines 4-5.

An example of finding a path
A detailed example of DPI dynamically finding a path is shown in Figure 3. Figure 3 (a)-(c) demonstrates how oracle selects the child node at round i, and Figure 3 (d) shows the tree assessment. To uniquely label the segments, we use the form B d,k,n in this example where d and k label the decision node B d,k,n belongs to and n is the order that this segment is added to Y d.k .

DPI performance analysis
In this section, we present the theoretical performance analysis of DPI by providing its regret bound as well as analyzing the complexity of the two core functions.

Theorem 1 If a segment B is selected at round i, we have
We prove this by constructing a sequence of variables using d,k (x i,d,k , a i,d,k ) and all the previously observed i,d,k (x i,d,k , a i,d,k ) . We prove that this sequence is super-martingale in the Appendix. We then prove the theorem by using Azuma-Hoeffding inequality on this sequence. The detailed proof of Theorem 1 can be found in the Appendix 1. This proves that with the probability of at least 1 − 2F −3 , B.U is an upper confidence bound on (B) .  (x i,d,k , a i,d,k ) by utilizing the condition of the activation rule, Theorem. 1, and the definition of the oracle index The detailed proof of Theorem 2 can be found in the Appendix 2. From Theorems 1 and 2, we can provide the regret bound as follows.

Theorem 3 The accumulated regret defined in (3) up to round F is
The proof of Theorem 3 can be found in the Appendix 3. We partition the F rounds by the radius of the selected segments. The number of rounds in each partition can be bounded by B.N. We add up the regret in each partition using the infinite geometric series formula. Theorem 3 demonstrates that R(F) is not a function of the number of paths. Given D and C, if F is sufficiently large, the accumulated regret per round R(F) F approaches zero. It shows that DPI achieves asymptotically optimal performance.

Complexity analysis
In this section, we show the complexity analysis of the oracle and the tree assessment. We show that the complexity of both the oracle and the tree assessment are not dependent on the number of the decision nodes of the tree.

Oracle
Because DPI is a real-time scheme, we discuss the computational complexity of exit oracle (Algorithm 4). According to Section 4.3, for any round i = 1, … , F , the tree assessment adds at most one segment to the segment set Y d,k at any decision node N d,k . This suggests that after processing i frames, there are at most i segments in Y d,k . Therefore, the complexity of Lines 4-6 of Algorithm 4 is at most O(i) . Therefore, the overall complexity of the oracle at round i is at most O(i).

Tree assessment
We then investigate the complexity of training the toolbox (Algorithm 5). For each selected segment, the complexity of Lines 3-5 is O (1) . There are at most D selected segments per round, therefore the overall complexity of tree assessment is at most O(D).

Overall complexity analysis
Because the depth of the tree is D, therefore, there are at most D decisions made at each round. Hence, the overall complexity of round i is at most O(Di + D) . When the number of the round i is large enough, the depth of the tree D is way smaller than i. Therefore the overall complexity of round i is O(Di) when i is large enough.

Evaluation
In this section, we present the evaluation of DPI, where we conduct comprehensive simulations to compare the performance with the ground truth on more general scenarios.

Settings
We consider two decision trees for simulation. The first and second decision trees (referred to as Tree 1 and Tree 2) are shown in Figures 4 and 5 respectively. For each round, we assign a context to each decision node and a reward to each terminal node, so that we utilize the largest reward of each round as the ground truth. We also investigate the variance of DPI under different scenarios to test its robustness. We consider two environments: (1) Stable environment: before the processing of each round i, a context vector is drawn from

and an outcome vector is drawn from
where D 1 and Π 1 are defined by us and will be learned by the schemes. (2) Sudden change environment: this environment investigates how the schemes adapt in a scenario where D 1 and Π 1 drastically change (e.g., the distributions affected by some unknown external factors). In the first half of the process, the context and outcome vectors are drawn from D 1 and Π 1 , and in the remaining processing, they follow D 2 and Π 2 . D 1 , Π 1 , D 2 and Π 2 are detailed in Appendix E.1.
We first investigate the accumulated outcome in relation to the number of rounds in two environments. To the best of our knowledge, there lacks another contextual bandit scheme that can be directly used as the benchmark of the decision tree problem in our study. Therefore, we compare DPI with the following four benchmarks, which are some of the classic MAB schemes: (1) Random exit scheme (Rand): the system randomly chooses a path before the processing of each round. (2) Explore-First (EF): the system chooses each path for K times and then processes the rest of the rounds using the path that has the largest average reward in the first K rounds. We set K = 10 .

Accumulated outcome comparing to benchmarks
We show the the accumulate outcomes in two environments using Tree 1 and Tree 2. For Tree 1, we process 10000 rounds for each scheme. We repeat the simulation for 50 times to show the variance. As shown in Figure 6 (a) and (b). DPI outperforms all of the benchmarks, demonstrating a clear advantage of utilizing the observed contexts at each decision node. In the stable environment, both UCB and -Greedy have similar performance to Rand. This is because they cannot recognize the relationship between the contexts and the rewards. In the sudden change environment, DPI and UCB both adapt to the new environment faster because they explore and exploit the environment simultaneously. DPI further improves the recovery speed compared to UCB, which is shown from its bigger slop. Overall, DPI can increase the reward to up to 1.3x compared to the best benchmark.
The results for Tree 2 are shown in Figure 6 (c) and (d). We process 100000 rounds for each simulation. Similar to Tree 1, DPI outperforms all of the classic MAB algorithms in both environments. Specifically, DPI can better adapt to the sudden change of environment compared to the benchmark. As shown in Figure 6, it takes more rounds for DPI to train the toolkit in Tree 2 compared to Tree 1. In the stable environment, the worst accumulated outcome becomes larger than the best of the benchmarks after processing 40000 rounds, while it only takes 6000 rounds in Figure 6 (a). This is caused by the complexity of the topology structure of Tree 2: It has 13 decision nodes, where each node has 3 actions. Hence, there are 39 actions to consider. Nevertheless, the mean accumulated outcome is consistently larger than all the benchmarks.

The realized outcome
We calculate the ratio of the realized outcome and the optimal outcome (the largest outcome in this round) at every round in two environments using two trees. We smooth the results by calculating the average ratio of the nearby 500 round for every point (binning method). Figure 7 (a) and (b) are the results using Tree 1, while the results using Tree 2 are shown in Figure 7 (c) and (d). The results show that for the stable environment, the ratio gradually increases and remains stable. In both Figure 7 (b) and (d), the sudden changes can be seen, and they soon recover and stabilize their ratios close to 0.9. This shows that DPI can successfully find the optimal decision for the vast majority of the time, indicating the robustness of DPI.

Regret analysis
We proved in the previous section that the accumulated regret of DPI is asymptotic to F 2∕3 . Here we demonstrate this by numerical results since the ground truth is known. In both environments, DPI processes 10 × 10 5 consecutive rounds and calculates the accumulated regret according to (3), and then we divide the accumulated regret by F 2∕3 . The results are shown in Figure 8. ℝ(F) F 2∕3 reaches constant values in both environments for both trees as F approaches infinity, which proves that the regret is asymptotically equivalent to F 2∕3 . This also proves that when the number of the round F is large enough, regret per round (i.e., ℝ(F) F ) approaches 0. proving the effectiveness of DPI in optimizing the decision making.

Conclusion
In this paper, we investigate the problem of dynamically finding the optimal path of a given decision tree via contextual bandits. We design DPI, an online learning scheme with three modules: the toolkit stores the processing history; the oracle makes prompt decisions at each decision node; and the tree assessment obtains the reward for the previous actions and trains the toolkit. We mathematically prove that DPI achieves an accumulated regret of O (log F) 1 3 F 2 3 DC for F frames. This theoretical regret analysis is supported by our numerical evaluation. We show that the complexity of the oracle and the tree assessment at round i-th are O(i) and O(D) respectively, which means they are not dependent on the number of the decision nodes of the tree. We evaluate DPI via a series of simulations, where the results demonstrate the effectiveness and robustness of DPI under a variety of environments. Furthermore, numerical simulation proves the theoretical regret analysis.
In the future, we will focus on solving another challenge DPI faces: The worthiness of choosing an interval node can be potentially affected by the performance of its child nodes. For instance, assume an decision node (node A) has two child terminal nodes. One has high expected outcome (node B) and the other has extremely low expected outcome (node C). Node A might be deemed as unworthy of choosing as a result of the poor performance node C provides in the previous rounds, affecting the possibility of selecting node B in the future.

A.1: Proof of Theorem 1
Proof Recall that at node N d,k , segment B has center B. , and we have (B) = d,k (B. ) . Let S i,B be the set of the label of rounds up to round i (i.e., before round i + 1 ) when segment B is selected at the decision node N d,k . Hence, we have

Lemma 4 Z i,B,j k=0,…,B.N is a super-martingale with bounded increments.
Proof In Theorem 1, we define ,k (x s,d,k , a s,d,k ) − d,k (x s,d,k , a s,d,k ) .
Recall that P j is the path selected at round j, and the realized reward of taking action a j,d,k given x j,d,k at the decision node N d,k ∈ P j at round j is j,d,k (x j,d,k , a j,d,k ) =̂j , where ̂j is the outcome observed at the terminal node of P j . The expected context-specific reward is We use the bottom-up approach to prove that for every decision node N d,k , we have d,k , a) . If N d,k is proven to satisfy this condition, we call it a clear node. We first start from the bottom where after taking action a, the child node N d+1,k � is selected (i.e., k � = (N d,k , a j,d,k )). (1) N d+1,k � is a terminal node: Therefore, if a terminal node is reached after taking action a at N d,k , N d,k is a clear node. ( (A16) max D(B i,d , B p ), (B p ).c) ≤ (B p ).r.

A.3: Proof of Theorem 3
Recall that event C is when Theorem 2 holds for all segments in all decision nodes. By the law of total expectation, the overall regret becomes We first consider the conditional regret [R(F)|C] where event C holds. According to the activation rule, for any segment B ∈ Y d,k of any decision node N d,k , there exists j ∈ ℕ that satisfies B.r = 2 −j . Set F d,k,r,F contains all segments in Y d,k with radius r at decision node N d,k up to the round F. For each segment B ∈ F d,k,r,F , let set Z i,d,k,n consists all the label of rounds up to round i where B is selected but has not become a parent segment yet, as well as the round when B is created. We have Z i,d,k,n = {j | j = 1, … , i, B = B j,d , B.c > B.r} . Note that B can be selected after it becomes a parent, but the activation condition in such rounds are satisfied and new segments are created, therefore Z i,d,k,n does not need to include these rounds. We can bound the size of set Z i,d,k,n by the definition of the exploration radius.
Based on the activation rule, the centers of the segments in F d,k,r,F are at least r from each other, hence we have | F d,k,r,F |≤ |C d,k | r ≤ C r for any N d,k ∈ N d , where C is the maximum degree among all the decision nodes.
Fix some r 0 ∈ (0, 1) . When a segment whose radius B.r ≤ r 0 is selected at decision node N d,k ∈ N d at round i, we have Δ d,k (x i,d,k , a i,d,k ) ≤ O(B.r) ≤ O(r 0 ) (Theorem 2), hence the total regret of all such actions in F rounds is at most O(r 0 F) . Therefore, the regret of event C can be written as follows: Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.