Quantifying the effect of feedback frequency in interactive reinforcement learning for robotic tasks

Reinforcement learning (RL) has become widely adopted in robot control. Despite many successes, one major persisting problem can be very low data efficiency. One solution is interactive feedback, which has been shown to speed up RL considerably. As a result, there is an abundance of different strategies, which are, however, primarily tested on discrete grid-world and small scale optimal control scenarios. In the literature, there is no consensus about which feedback frequency is optimal or at which time the feedback is most beneficial. To resolve these discrepancies we isolate and quantify the effect of feedback frequency in robotic tasks with continuous state and action spaces. The experiments encompass inverse kinematics learning for robotic manipulator arms of different complexity. We show that seemingly contradictory reported phenomena occur at different complexity levels. Furthermore, our results suggest that no single ideal feedback frequency exists. Rather that feedback frequency should be changed as the agent’s proficiency in the task increases.


Introduction
Reinforcement Learning (RL) has become widely used in modern robotic technologies.One reason is the compelling simplicity and generality of the framework.In short, the behaviour an agent is expected to learn is encoded by a reward function.Through interaction with the environment, the agent will learn to maximize the reward by performing actions that have proven to be beneficial.However, this seeming simplicity has many pitfalls and subtleties.One common shortcoming of most algorithms is the very low data efficiency: complex problems might require millions of agent-environment interactions to be solved [e.g., 1].
One strategy to accelerate this procedure is interactive reinforcement learning (IRL) [2].Interaction augments the sources of information provided to the learning agents by teacher feedback, which can be a human or another type of agent [3].In the latter case, it is also known as the agents teaching agents subfield of transfer learning [4].There are numerous alternatives to implement IRL as described by Arzate Cruz and Igarashi [2].A formance graphical overview is provided in Figure 1.Firstly, the teacher feedback can be classified into critique (binary), scalar values, action advice and guidance.Further, this feedback can be used to modify different aspects of the learning model, i.e., the reward function (reward shaping [eg., 5]), the policy (policy-shaping [eg., 6]), the exploration process (guided exploration process [eg., 7]), and the value function (augmented value function).The literature on interactive reinforcement learning is extensive, such that many combinations shown in Figure 1 have been explored already.One consistent result is that interaction in any form can perform better than vanilla RL agents.However, it is still unclear how much the different aspects contribute to overall performance gains or how different feedback strategies can be combined, and in which proportions, to optimize users' experience, agents' task performance, or both.

RL-based Agent Environment
IRL algorithms are mainly developed with human teachers in mind.Empirical evidence indicates that people are strongly biased to use evaluative feedback communicatively rather than as reinforcement [8,9].In other words, humans use evaluative feedback as communication as a policy-shaping strategy rather than reward shaping.Arguably, this type of feedback favours IRL strategies based on policy shaping and guided exploration, because it would make the interaction both more engaging for the teacher and more effective for the learning agent [eg., 6,9,10].Results from Ho et al. [8], showing that people consistently use feedback communicatively even when interacting with reward-maximising agents, further support this claim.Moreover, using feedback signals as rewards and punishments when interacting with reward-maximising RL agents can lead to reward hacking [11].Reward hacking is a consequence of misspecified reward functions, which lead to undesired behaviours, such as when action sequences leading to the reward from the human are repeated at the expense of learning to complete the task more generally.
Despite efforts to improve the study of interaction on RL agents with human teachers, we believe there is still much to be learned using simulated teachers and oracles as suggested by Bignold et al. [12].Moreover, human feedback varies in accuracy, availability, concept drift, reward bias, cognitive bias, knowledge level, latency, etc. [12], which makes it very challenging to isolate the effects of different interaction strategies in reinforcement learning agents.Fortunately, pre-trained agents or hard-coded heuristics can be used as feedback sources without requiring modifications to the learning algorithms.These types of 'teachers' are primarily used in theoretical research since it allows for better controllable and more easily implementable experiments.
Different strategies have been compared regarding policy-shaping, such as early advising, alternating or stochastic advising, importance advising, and mistake correction.Mistake correction consistently outperformed the other methods, both in discrete [13,14] and continuous state and action spaces [13].Taylor et al. [13] also noted that mistake correction is more robust to changes in feedback quality than alternate advising.In addition, mistake correction and predictive advising are most robust to changes in the state representation between teacher and agent.However, as noted by Cruz et al. [14], mistake correction in policy shaping would be the most difficult strategy to implement in real-world scenarios with human teachers since it requires the teacher to detect the mistake, revert it, and suggest a better alternative action.
A more straightforward way is using mistake correction for guided exploration.Here, the teacher must detect and revert the error but not necessarily suggest a better action.In addition, despite its popularity, we believe that policy-shaping strategies might hinder the learning of robust policies by reducing exploration, which leads to good performance primarily in the neighbourhood of the behaviours demonstrated by the teacher [15].Limiting the exploration in this manner can result in poor performance in other areas of the state space not or rarely encountered during training [15].
In the literature on feedback-guided exploration, Stahlhut et al. [7,16] found that mistake correction does not help to increase the learning speed in simple tasks but only starts to have a measurable effect as the task complexity increases.It was also observed that higher feedback frequencies lead to more robust agents, i.e., that the average agents' performance for the same hyperparameters has a smaller standard deviation across different random seeds.Moreover, feedback can offset the detrimental effect of poorly tuned hyperparameters as a byproduct of this increased robustness.This effect becomes stronger as feedback frequency increases, regardless of the complexity of the problem.Stahlhut et al. also speculate that interaction has a more significant impact during early learning.In addition, they hypothesise that feedback may be indispensable to achieve sufficient performance in very complex tasks, in agreement with Suay and Chernova's hypothesis [15].
Millán-Arias et al. [17] further investigated the effect of feedback frequency used to guide the exploration process.They observed that too much feedback might lead to delayed onset of learning.Despite that, the performance of highly interactive agents converges at the same time as the performance of less interactive ones.In addition, they speculate that too much binary advice, even if 100% correct, can be counterproductive and slow down learning, particularly in noisy environments.They conclude that intermediate interaction frequencies are optimal.
Summarizing previous findings, there are strikingly contradictory accounts of the optimal choice of feedback frequencies.At the same time, some authors suggest that more feedback is better [7,16], while others indicate that the cost of high feedback frequencies does not justify the gains [12].In contrast, others advocate for intermediate feedback frequencies and report even detrimental effects of frequent feedback [17].It was also suggested that the feedback frequency should not be stationary but adjusted as the agents' proficiency increases during training [7,16,18].
We assume that the reason for the disagreement is a lack of knowledge about the differential effects of varying feedback frequencies at different levels of agent proficiency and task complexity.Consolidating these previous findings without further experimentation is complicated since most results only show cumulated reward or sequence length.Moreover, effect sizes, average performance, and statistical analyses are not reported in most cases.Also, the complexity of the setup might make it impossible to isolate the effects of the different parameters used [18].In addition, the most common testbeds in IRL research are grid worlds and other low-dimensional discrete state and action spaces.Whereas the small size of these testbeds allows for short experiments, we believe results obtained in those testbeds might not generalize well to more complex problems with real-world implications [2].
Thus, in this paper, we aim to isolate and quantify the effect of feedback frequency on learning performance for different task complexities and agent proficiencies, to shed some light on seemingly contradictory results.As testbeds, we use robotic tasks of varying complexity and continuous action and state spaces.We focus on feedback as mistake correction to guide the exploration process since it does not demand expert knowledge of the task.In our experiments, we reproduce various seemingly contradictory findings about the optimal choice of interaction frequencies and relate them to a differential effect of the teacher interaction on task complexity.We also show that optimal feedback frequencies typically exhibit temporal drifts, making it difficult to recommend a single range of feedback frequencies for any task.We instead conclude, in line with previous suggestions [7,16,18], that an adaptive interaction regime, which changes with the agents' proficiency, is likely optimal.Finally, we discuss a simple heuristic for choosing a closeto-optimal temporal trajectory for the interaction rate.

Methods
This section details all experimental and analysis methods used in the paper.

Environment
Inspired by Stahlhut et al. [7,16], we study the effect of feedback frequency as exploration guidance in an inverse kinematic learning task.The formance environment dynamics were implemented by the forward kinematics models of the NAO and KUKA (LBR iiwa 14 R820) robots.
For the NAO robot, the forward kinematics model described by Kofinas et al. [19] was used.Two conditions for the NAO robot were defined: a 2 degrees of freedom (DoFs), and a 4 DoFs condition.For the 2 DoFs configuration, the elbow and shoulder roll were actuated.In the 4 DoF condition, all four joints are used, i.e., shoulder pitch, shoulder roll, elbow yaw and elbow roll.
The KUKA LBR iiwa 14 R820 kinematics were simulated with the model described by Busson et al. [20].For the KUKA arm, three conditions were studied, i.e., 2 DoFs, 4 DoFs, and 7 DoFs conditions.For the 2 DoFs configuration, joints 2 and 4 were actuated while keeping the other joints in their respective zero-position.For the 4 DoFs configuration, the first four joints were actuated while keeping the other joints in their respective zero-position, and all 7 joints were actuated in the 7 DoF condition.
The 2 DoFs models of the NAO and KUKA are used to study the role of feedback frequency in 2-dimensional task spaces.In addition, The 4 DoFs and 7 DoFs conditions of NAO and KUKA are examples of more complex 3-dimensional task spaces.

Task and Reward
All experiments aim to generate controllers that can reach arbitrary goal zones in task space while controlling the robot arms in joint space.A sparse reward function is used, i.e., reaching the goal zone leads to a reward of 1.All other actions result in a reward of 0. Such a reward function allows us to isolate the effect of feedback and analyse the learning dynamics more easily.Moreover, adding other rewards signals, such as punishment signals, might have a detrimental effect on learning speed [e.g., 21,22], which makes both analysis and design of the reward function difficult.
The Goal Zone Radius (GZR) for both NAO configurations is 17.5mm, and 150mm for the KUKA arm conditions.

Interactive RL Setup
We use the Continuous Actor-Critic Learning Automaton (CACLA) [23] as the underlying reinforcement learning framework.The agent has an Ask Likelihood (L) parameter, representing the likelihood of the agent asking/receiving guidance from the teacher.The teacher judges the agent's last action based on the Euclidean distance between the end-effector and the goal.However, the feedback to the agent is binary, and it is only used to guide the exploration process and not as an additional reward or to shape the policy directly.In particular, if the last action increases the distance to the goal, it is considered a mistake.When the teacher reports a mistake, the agent undoes the action and explores a new one, after which the cycle is repeated.This process is illustrated in Figure 2. The stochastic feedback strategy used here is a good analogy for teachers taking breaks when providing feedback.Although this type of feedback is easy to automate, it might not always be correct.For instance, in the presence of obstacles or complex task spaces, it might be required first to move away from the goal position to reach it.Thus, L cannot be equal to 1.To prevent the agent from potentially getting stuck, we keep the maximum value of 0.99 used by Stahlhut et al. [7,16].

State and Action Spaces
The state space S for all conditions consists of the corresponding joint positions (proprioception) and the Cartesian coordinates of the target position (exteroception).For all conditions, the action space is limited to the maximal allowed joint displacement per time step of π /10.I.e.A = [− π /10, π /10] DoF .

The Controller
The Actor and Critic are implemented with two separate multilayer perceptrons (MLPs).The networks share the same input vector.However, the networks are tuned separately using hyperparameter optimization as described in Section 2.9.Thus, the learning rate, number of hidden layers and outputs, etc. may differ between the Actor and the Critic.All input and output values are scaled to the range [−1, 1].The activation function for the output units is linear.The networks were implemented in PyTorch 1.10.0 and trained using Adam [24].

Episodes
Based on both the maximum range of motion of the joints and the maximal action size, the smallest number of steps needed to traverse the entire task space was computed as follows: Steps min was then used to define the episode length Steps max as 3 × Steps min rounded to the next tenth.
The minimum number of goals zones G min to cover the entire workspace was used to determine the number of episodes N train per epoch.G min was calculated using optimal disc (2D task space) or ball (3D task space) packing in the task space volume.N train results from 10 × G min rounded to the second leading digit.Table 1 shows a summary of the episode parameters conditions.

Performance Metrics
The following metrics were used to quantify the effect of feedback frequency.Lower values indicate better performance: • Positioning error : mean Euclidean distance to the target divided by the target radius.• Failure rate: percentage of missed targets during evaluation, which is equivalent to Failure rate = 1 − average cumulative reward.
The performance learning curves are analysed with respect to the cumulative steps instead of epochs since it better reflects the total amount of interaction with the environment.The slope of the failure rate is used as an indicator of the improvement speed.
We furthermore consider thresholded performance profiles.Note that for visual clarity, the error bars represent the standard error of the mean and not a confidence interval.Here, the steps of the L agent reaching the corresponding failure rate threshold first are used as reference.Then the best failure rate up to this step count is compared across different L values.For this analysis, we used a twosided Wilcoxon rank sum test with respect to the L = 0.0 condition.This temporal threshold strategy copes better when dealing with conditions that cannot achieve an arbitrary performance threshold than the time to threshold [25] strategy.

Datasets
Following similar practices as in supervised learning, three datasets were used: a training, a validation, and a test set.Both the training set and test set are of the same size N train , while the validation set is 1 /5 the size of the training set, see Table 1.The datasets consist of pairs of initial positions for the agent and the target.These positions are generated randomly from a uniform distribution in joint space.The targets are represented in Cartesian coordinates and result from feeding random joint configurations into the forward kinematics model of the corresponding robotic arm.Target coordinates that lie within the goal zone of the corresponding initial position are rejected.One epoch is defined as training for all pairs in the training set once.formance 2.9 Hyperparameter Optimization Hyperopt [26] was used to determine the best hyperparameters out of 100 hyperparameter sets for each experimental condition.Preliminary tests showed signs of significant performance improvement by the 10 th (2 DoFs) or 20 th (4 DoFs and 7 DoFs) epoch.Thus, during hyperparameter selection, the 2 DoFs conditions agents were trained for 10 epochs while the 4 DoF and 7 DoFs conditions were trained for 20 epochs.In all cases, we used the corresponding training set.The best hyperparameters set was selected based on the lowest positioning error in the validation set at the respective final epoch.
Prior tests showed that optimizing for the lowest positioning error or fastest convergence speed leads to similar results.In real-world scenarios, it is arguably more important to have the robotic arm performing the defined task precisely -with minimal possible error -than learn to perform quickly but with low repeatability or precision.Thus, here the hyperparameters were optimized for the lowest positioning error.
The hyperparameter search can be done at least in two manners: 1) optimizing the hyperparameters for each Ask Likelihood (L) value independently, or 2) optimizing the hyperparameters only for L = 0.0 and evaluating the performance for increasing values of L. The latter was selected for two reasons.Firstly, hyperparameter optimization is computationally expensive.Secondly, this strategy allows for quantifying the gain of using a particular feedback frequency in an existing system of vanilla RL (L = 0.0).
Table 2 summarizes the hyperparameters and optimization boundaries.The last five hyperparameters corresponding to the neural network configuration are optimized independently for the Actor and Critic, but share the same ranges.

Training and Testing
The agents are trained on the same training set used for the hyperparameter optimization, but this time the agents' performance is evaluated on the test set.All agents are trained for a fixed number of epochs.Eleven agent versions for each condition were trained, including the baseline agents for L = 0.0 (non-interactive) and ten other agents sets with L values increasing in increments of 0.1, with

Results
In the following section, we present the temporal evolution of the failure rate on the test sets for all experimental conditions.The failure rate value at each point is the average over the 20 randomly initialized agents.
In addition, for each experimental condition, we compare the failure rate for different L values at various threshold levels to assess the optimal L value at different stages in training.

NAO 2 DoFs Experiment
Figure 3a shows the performance evolution for the NAO 2 DoFs condition.The circles indicate the best failure rate value achieved during training for each L value.The failure rate continuously improves with higher L values, even reaching perfect performance in late training for L ≥ 0.7, whereas the performance for low L starts to diverge around 10 4 steps.The failure rate curves' rate of improvement is comparable for all L values during the first few epochs.
Figure 3b shows the thresholded failure rate performances in the NAO 2 DoFs condition.The cumulative step counts corresponding to the thresholds are marked by blue arrows in Figure 3a.At the first threshold levels, there is a trend toward a significant lower failure rate as L values increase.However, the behaviour is dynamic, and no single L value remains the best through training.Within the tested time horizon, the final performance favours the highest L values.

NAO 4 DoFs Experiment
Figure 4a shows the performance evolution for NAO 4 DoFs condition.Values up to L = 0.7 convergence to a similar value.In contrast, the best performance of higher L is reached earlier, after which the failure rates start to diverge slowly.The improvement speed is initially faster for higher L values before they start to diverge. Figure 4b shows the corresponding time thresholded failure rate analysis.Here the highest effect on the failure rates is observed in the first half of training at very high L values.In the later phase of training, at the 5% threshold, the optimal L shifts towards medium and high L values.Finally, a significant benefit is mostly absent for the strictest threshold of 3% and beyond.However, for L ≥ 0.9, the effect is significantly detrimental, as indicated by the red markers in Figure 4b.

KUKA 2 DoFs Experiment
Figure 5a shows the performance evolution for the KUKA 2 DoFs condition.The overall failure rate in this condition is relatively high.However, a similar trend to that of the NAO 2 DoF condition can be observed, i.e., the failure rate continuously improves with a higher L value.In contrast, the performance for low L values starts to diverge after the respective best performance is reached.
The improvement speed for low to medium L agents is initially higher.Whereas the improvement speed for higher L values is slightly lower, it is maintained almost constantly throughout the tested time horizon.
Figure 5b shows the time thresholded failure rate for the KUKA 2 DoFs condition.Unlike in both NAO conditions, here, early in training, the interaction does not yield any benefits.Interaction is even significantly detrimental for very high L values.Only towards the end of training does interaction significantly reduce the failure rate.

KUKA 4 DoFs Experiment
Figure 6a shows the performance evolution for the KUKA 4 DoFs condition.Again, the overall failure rate in this condition is relatively high.In contrast to all other experimental conditions, all L values lead to a monotonically improving failure rate.Lower L values initially show a faster rate of improvement but slow down as learning progresses.In contrast, very high L values display a lower rate of improvement, which is, however, almost constant throughout the tested time horizon.
Figure 6b shows the time thresholded failure rate for the KUKA 4 DoFs condition.Early in training, interaction leads to a significant reduction in failure rate primarily for intermediate L values.In contrast, the highest L values have a significantly detrimental effect on performance throughout the tested time horizon.However, from the data, it cannot be judged what the very long-term behaviour of the agents will be since the performance has not converged.

KUKA 7 DoFs Experiment
Figure 7a shows the performance evolution for the KUKA 7 DoFs condition.As for the 2 DoFs conditions, low to medium L agents reach a local optimum around 2 × 10 6 steps, after which the performance temporarily deteriorates.However, in contrast to the 2 DoFs conditions, the performance continues to improve beyond the initial local optimum.Higher L agents have a lower rate of improvement but do not experience any divergent behaviour, at least within the tested time horizon.Figure 7b shows the time thresholded failure rate for the KUKA 7 DoFs condition.Similarly, as in the previous KUKA conditions, low to medium L values lead to significantly better performance early in training.In contrast, very high L values lead to significantly worse failure rates.The detrimental effect becomes stronger the higher the L value.Only for longer training horizons do high L values start to become significantly beneficial, albeit not optimal.The long-term behaviour cannot be clearly judged here since the performance has not converged.Table 3 shows a combined summary of the statistical analyses of the effects on the failure rate thresholds for all tested robotic tasks and L values.The effect size reported is the difference of means.

Discussion
Our thorough investigation of the Ask Likelihood's effect on the evolution of the failure rate over time allows us to make more nuanced statements on task-dependent effects of the interaction rate than previously reported.Furthermore, our experiments can unify seemingly contradictory statements on the best choice of L. In summary, across the different experimental conditions, we make three main observations: 1) policy robustness, 2) optimal long-term L, and 3) optimal L trajectory.

1) Policy robustness:
With the exception of the KUKA 4 DoF condition, low L agents are prone to suffer from performance divergence after reaching an initial local optimum.High L agents in the same condition do not show this behaviour-see the KUKA 7 DoF, NAO 2 DoF, and KUKA 2 DoF conditions.The divergence could be caused by the limited time horizon of the hyperparameter optimization.
In all cases where divergence occurs, it sets in after the number of epochs used for the optimization.In this case, high L agents not suffering from divergence would be in line with Stahlhut et al. [7,16], who report that high L values lead to more robust policies and are less sensitive to optimally tuned hyperparameters.
However, we note that the seemingly fast divergence is in part attributed to the log-log scale of the figures: the number of steps at which the agents stay close to their local optimum is relatively large compared to the initially fast convergence.An alternative explanation is that this divergence happens regularly but is rarely observed or reported since the training is stopped automatically when initial convergence is reached (early stopping).The reason for this divergence could be the phenomenon called capacity loss, which was only recently described.Lyle et al. [27] show that agents can lose the ability to adjust their value function approximator in light of new prediction errors.Capacity loss is attributed to a state representation collapse in the function approximators.This collapse seems most prevalent in temporal difference learning algorithms, using neural networks as function approximators, and sparse and non-stationary rewards.
An exciting question is whether high L agents are more robust to capacity loss.However, this investigation is beyond the scope of this paper.

2) Optimal long-term L:
High L values are mostly beneficial in the long run, except for the NAO 4 DoF condition.In all other conditions, high L agents reach either the best or comparable to the best performance observed for other L values, albeit at later stages in training.
This observation agrees with Stahlhut et al. [7,16], who report increasing performance with increasing interaction frequency.However, in the NAO 2 DoF case, there is only a small benefit of the highest L value over the others in the long run.Thus, the gain can be considered not very large, in accordance with Bignold et al. [12], who argue that the increased effort of very high frequent interaction does not justify the small performance gains.

3) Optimal L trajectory:
During training within one experimental condition, the best choice of L depends on the proficiency level and typically changes over time.For instance, in the NAO 4 DoF experiment, the optimal L changes from intermediate values in early training to low values in late training.However, this pattern does not generalize across tasks.E.g., the optimal L changes from intermediate to high values in the NAO 2 DoF condition, in contrast to the NAO 4 DoF experiment.
This observation can encompass the following findings: • early feedback is beneficial [18], as seen in all but the KUKA 2 DoF condition, • intermediate feedback frequencies are optimal [17], as seen in the NAO 2 DoF and KUKA 7 DoF conditions across most of the training, • and that too much early feedback leads to a delayed onset of learning [17], as seen in all 3 KUKA conditions.
Furthermore, the shift of optimal L values during training leads us also to conclude that the interaction rate should be changed adaptively for optimal performance, as also suggested by Cruz et al. [18] and Stahlhut et al. [7,16].
As a proof of concept, we trained agents on the KUKA 2 DoF task, starting with L = 0.0, switching to L = 0.5 at epoch 14, and finally to L = 0.99 at epoch 18. Figure 8 shows that it is possible to achieve the early convergence of the low L agents in this task, combined with the longterm refinement of high L agents.The switch of L induces a short-term performance deterioration.However, measuring the performance as the area under the curve, the adaptive strategy is superior to the fixed L agents.
An important question is how to choose the optimal adaptive L-strategy without having to train agents for various L-values before.Whereas it seems to be a good rule of thumb to switch to high L in late training, the optimal values in early training are very diverse.Here, low L values are optimal for the KUKA 2 DoF condition, intermediate for KUKA 4 DoF, KUKA 7 DoF, and NAO 2 DoF, or high values for the NAO 4 DoF, spanning a range of L = 0.2 to L = 0.8 across experiments.
We hypothesize that the optimal L in early training is influenced by how much the teacher simplifies the task.Expressly, if the task for high L values becomes too easy in comparison to the L = 0.0 task, the agent might fail to generalize and explore too little.Note that the agent effectively faces the L = 0.0 situation during evaluation since it is not receiving feedback then.Specifically, consider a fully interactive agent.Here, the teacher will rarely allow actions that increase the distance of the end effector to the goal.This scenario simplifies the task during training but also entails that sub-optimal state-action pairs are seldomly encountered.Note that this situation primarily applies to mistake-correcting teachers as used in this article.
We quantify the reduction in task complexity by L as the average failure rate of a newly initialized agent on the training set with an interaction frequency L, but without policy updates.All failure rates for L = 0 are normalized to the performance of the baseline L = 0 agent.Each value is averaged over 20 randomly initialized agents.We compare the relative task complexities with the best choices of L at similar stages in early training.For this, we use the failure rate thresholds of 50% for KUKA 7 DoFs and KUKA 4 DoFs, 40% for KUKA 2 DoFs, 25% for NAO 4 DoFs, and 5% for NAO 2 DoFs (see Figures 3b, 4b, 5b, 6b, 7b).Note that it is not feasible to use the same value for all experiments since the initial performance across experiments varies between ≈50% and ≈90%.
Figure 9 shows the relative task complexity for all experiments and L values, along with the best choices of L at the mentioned thresholds and the significantly detrimental choices.Indeed, L values that lead to relatively low task complexities are prone to have a detrimental effect.In contrast, the most beneficial choices of early L values are those that lead to a relative task complexity between ≈ 0.78 to ≈ 0.95.Thus, drawing an initial L from that range for each task makes it considerably more likely to choose the initially optimal L value than naively sampling from the range of L = 0.2 to L = 0.8.
Based on this observation, we argue against the claims that either high, intermediate, or low L values are optimal in early training.Instead, optimality seems to be better predicted by the task complexity reduction induced by L.

Conclusion
In this study, we conducted a thorough extension of previous research investigating the effect of feedback frequency on agent performance in continuous action and state spaces.The main contribution is the discovery that task complexity and performance threshold influence the interpretation of the best interaction rate with a teacher.Moreover, no single best solution exists across task conditions.Our results instead suggest that the optimal interaction rate changes over time and that the task complexity determines the specific optimal trajectory for L. These observations allow us to consolidate previously contradictory claims on the optimal interaction frequency.Furthermore, we described a heuristic to choose the initial feedback frequency based on a measure of the relative task complexity changes induced by the teacher's feedback on an untrained agent.

Future work:
A future goal is to probe the described heuristic further to predict the optimal trajectory before -and potentially adjust it during -training.Such a strategy has the potential to increase data efficiency significantly.Drawing such conclusions across an even more comprehensive range of tasks requires more extensive experimentation with more tasks of different complexity.
It is also necessary to determine the deeper cause of seemingly differential effects of task complexity reduction by teacher interaction and the relation to potential capacity loss in the agents' neural network function appropriators.
Finally, it will be helpful to quantify the interaction of the feedback frequency effects with other factors, such as fixed feedback budgets and advice quality, to go towards applicable scenarios with realistic human feedback.

Fig. 2
Fig. 2 Diagram representing the basic RL setup, with the interactive components colored in red.

Fig. 3 (
Fig.3(a): Failure rate evolution for the NAO 2 DoFs experiment in log scale.The circle indicates the best performance for the corresponding L value for the whole training.The blue arrows show the number of environment steps needed for the fastest L agents to reach 10%, 5%, 2%, and 0.3% failure rate.(b): Time thresholded failure rates for the NAO 2 DoFs condition.Statistical significance with respect to the L = 0.0 condition was computed using a two-sided Wilcoxon rank sum test.The N show significance with respect to L = 0.0 at p < 0.05, while the # show significance with respect to L = 0.0 at p < 0.001.

Fig. 4 (
Fig. 4 (a): Failure rate evolution for the NAO 4 DoFs experiment shown in log scale.The circle indicates the best performance for the corresponding L value.The blue arrows show the number of environment steps needed for the fastest L agents to reach 50%, 25% 10%, 5%, and 3% failure rate.(b): Time thresholded Failure rates for NAO 4 DoFs.Statistical significance was computed using a ranksum test.The N show significance with respect to L = 0.0 at p < 0.05, while the # show significance with respect to L = 0.0 at p < 0.001.Red markers indicate significantly detrimental effects.

Fig. 5 (
Fig. 5 (a) Failure rate evolution for the KUKA 2 DoFs experiment shown in log scale.The circle indicates the best performance for the corresponding L value.The blue arrows show the numof environment steps needed for the fastest L agents to reach 50%, 40%, 30%, and 20% failure rate.(b) Time thresholded failure rates for the KUKA 2 DoFs condition.Statistical significance was computed using a two-sided Wilcoxon rank sum test.The N show significance with respect to L = 0.0 at p < 0.05, while the # show significance with respect to L = 0.0 at p < 0.001.Red markers indicate significantly detrimental effects.

Fig. 6 (
Fig. 6 (a) Failure rate evolution for the KUKA 4 DoFs experiment shown in log scale.The circle indicates the best performance for the corresponding L value.The blue arrows show the number of environment steps needed for the fastest L agents to reach 70%, 50%, 30%, 20%, and 15% failure rate.(b) Time thresholded failure rates for the KUKA 4 DoFs condition.Statistical significance was computed using a two-sided Wilcoxon rank sum test.The N show significance with respect to L = 0.0 at p < 0.05, while the # show significance with respect to L = 0.0 at p < 0.001.Red markers indicate a significantly detrimental effects.

Fig. 7 (
Fig. 7 (a) Failure rate evolution for the KUKA 7 DoFs experiment shown in log scale.The circle indicates the best performance for the corresponding L value.The blue arrows show the number of environment steps needed for the fastest L agents to reach 75%, 50%, 25%, 10% and 5% failure rate.(b) Time thresholded failure rates for the KUKA 7 DoFs condition.Statistical significance was computed using a two-sided Wilcoxon rank sum test.The N show significance with respect to L = 0.0 at p < 0.05, while the # show significance with respect to L = 0.0 at p < 0.001.Red markers indicate a significantly detrimental effects.

Fig. 8
Fig. 8 Adaptive failure rate in the KUKA 2 DoF task.The initial L = 0.0 is changed to L = 0.5 at epoch 14 and L = 0.99 at epoch 18.The 50% opacity black and yellow curves show the original performance for fixed L = 0.0 and L = 0.99, compare Figure 5a.The adaptive agents show both features of fast early training and long-term convergence of the fixed L agents.

Fig. 9
Fig. 9 Relative task complexity reduction induced by feedback frequencies for all experiments.Interaction has a differential effect on the complexity reduction, depending on the task.Black symbols show the best choice of L for all five experiments during early training (compare Figs. 3b to 7b, second failure rate thresholds).Red symbols show significantly detrimental choices for L at this failure rate threshold.

Table 1
Boundary conditions for the episodes.Steps min : min.# of steps to cover the task space, Stepsmax: max.# of steps the agent can take to reach the target, G min : min.# of goals to cover the task space and N train : # of episodes per epoch.

Table 2
List of hyperparameters to be optimized and their interval of possible values.

Table 3
Effect Size for different values of L across robot tasks, computed as difference of means.Statistical significance was determined by a Wilcoxon ranksum test.The N show significance with respect to L = 0.0 at p < 0.05, while the # show significance with respect to L = 0.0 at p < 0.001.Red markers indicate significantly detrimental effects.