Adaptive intelligent vision-based control of a flexible-link manipulator

Present space robots such as planetary robots and flexible robots have structural flexibility in their arms and joints that leads to an error in the tip positioning owing to tip deflection. The flexible-link manipulator (FLM) is a non-collocated system that has unstable and inaccurate system performance. Thus, tip-tracking of FLM possesses difficult control challenges. The purpose of this study is to design adaptive intelligent tip-tracking control strategy for FLMs to deal with this control challenges of FLM. A vision sensor is utilized in conjunction with a traditional mechanical sensor to directly measure tip-position in order to address the aforementioned problem. Image-based visual servoing (IBVS), one of several visual servoing control techniques, is more efficient. However, the IBVS scheme faces numerous difficulties that impair the system’s performance in real-time applications, including singularities in the interaction matrix, local minima in trajectory, visibility issues. To address the issues with the IBVS scheme, a novel adaptive intelligent IBVS (AI-IBVS) controller for tip-tracking control of a two-link flexible manipulator (TLFM) is designed in this study. In particular, this paper addresses the IBVS issues along-with retention of visual features in the field-of-view (FOV). First, in order to retain object within the camera FOV, an intelligent controller with off-policy reinforcement learning (RL) is proposed. Second, a composite controller for TLFM is developed to combine RL controller and IBVS controller. The simulation has been conducted to examine the effectiveness and robustness of the proposed controller. The obtained results show that the AI-IBVS controller developed here possesses the capabilities of self-learning and decision-making for robust tip-tracking control of TLFM. Further, a comparison with other similar approach is presented.

consumption during transportation, larger payload handling capacity, increased maneuverability, and faster operational speed, the flexible-link manipulator (FLM) has many advantages. However, compared to a rigid manipulator, the structural flexibility of FLM arms and joints causes inaccuracy in tip positioning [1]. Over the past four decades, research on FLM control has been active. The control of flexible-link manipulators (FLMs) is well reviewed in [2,3]. Because FLM is nonlinear and non-collocated, it acts as a nonminimum phase system. Additionally, model truncation and errors are evident, which affects system stability and also leads to the inaccurate tip-tracking performance.
The primary cause of the non-collocation in FLM is the placement of the sensor and actuator in different locations. The majority of the literature uses a standard mechanical sensor, such as a accelerometer, encoder, strain gauge to measure tip position information. However, occasionally electromagnetic interference causes these sensors to perform poorly in the difficult environment and give a noisy response. Since the tip point information is measured indirectly by these mechanical sensors, a model is required to relate the information to the tip deflection. Moreover, wave propagation along the beam causes the end-effector response to occur a little bit later than a control input. To address this issue, sensor and actuator averaging method were developed in [4]. However, the use of multiple sensors and actuators increases the weight of the flexible manipulator. Instead of using mechanical sensors, optical sensors can also be utilized for the measurement of tip point information, but they are very susceptible to noise. These challenges, which yield an indirect estimate of tip point deflection, are overcome by the vision sensor. Research in flexible manipulator high-performance control using visual servoing (VS) has grown recently. VS in FLM can significantly increase the accuracy of the tip point information.
The eye-in-hand configuration (camera placed in tip, just observing target object) is taken into consideration in this work because it does not take kinematics into account when determining positioning accuracy. Based on the error, there are four visual servoing strategies. It has been established that image-based visual servoing (IBVS), which is more competent than other VS techniques, is one of the preferable strategy for controlling FLMs. Additionally, IBVS removes inaccuracies caused on by sensor modeling and is adaptable to errors in camera calibration. However, the IBVS scheme faces numerous difficulties that impair the system's performance in real-time applications, including singularities in the interaction matrix, local minima in trajectory, visibility issues.
Singularity and local minima in IBVS are caused by improper pairings of visual features that impair FLM's ability to monitor tips. Recent studies reveal that IBVS faces two significant difficulties: (1) choosing visual features to avoid singularities in the interaction matrix and (2) designing a control scheme using those chosen visual features such that FLM track the target trajectory with the least amount of tracking error. Designing and choosing appropriate visual features for IBVS is a challenging task. In [5], the shifted moment-based visual feature is used to address the IBVS approach's issues with singularity in the interaction matrix and local minima in trajectories. The work described in [5] demonstrated robustness with a field-of-view (FOV) limitation, i.e., when the object is partially occluded out of the FOV.
Usually, measured visual features are used as control input for IBVS to compute the controller output. However, due to disruption during movement, objects may occasionally depart the camera's FOV. Keeping the visual characteristics in the camera's field of view becomes difficult in this case. Additionally, the stability and performance of the system are directly impacted by the visual features' visibility. However, the work presented in [5] may fail if the object is fully out of the FOV. Given the success of the image moment-based visual serving control scheme in many robotic applications, in this work to address the visibility issue of IBVS, we expand the approach to design and build an adaptive IBVS controller based on image moment for robust tip-tracking control of TLFM.
Many approaches have been reported to prevent the aforesaid visibility issue of IBVS, for example, potential field [6], navigation function [7], path planning [8]. Also, the visibility issue of IBVS is addressed by employing a pan-tilt camera [9], odometry with vision system [10] and specific visual features [11]. The methods described in [6][7][8][9][10][11] lack the self-learning and online decision-making capabilities, rendering them unsuitable for real-time applications (i.e., they cannot automatically adapt to changing control tasks). Also, these approaches cannot guarantee that all visual features remain in the FOV [12]. Therefore, a machine learning solution is necessary to solve the aforesaid issue of IBVS. In the realm of robotics, reinforcement learning (RL) [13] is a wellknown method for increasing flexibility to changing control tasks and environments and for enhancing self-learning and decision-making capabilities. RL in robotics is applied for control of flexible aircraft wing [14], TLFM [15], SLFM [16] and in many other applications. The algorithm in [15] employs the method of on-policy learning. In the design of proposed intelligent controller, the off-policy learning method is used, as it is model-free, data efficient and faster as compared to the on-policy learning method [17]. In order to keep objects in the FOV of the camera, an intelligent controller with off-policy reinforcement learning is proposed in this study.
In this line of research, similar studies that combine both RL and VS for mobile robot are presented in [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. For VS-based control of a 7-DOF redundant robot manipulator to reach the target position, a self-organizing map (SOM) network-based learning algorithm has been given in [28]. In [29], an interesting method for controlling a mobile robot manipulator by fusing RL and IBVS is described. In this work, off-line training with traditional Q-learning is adopted for robust grasping of spherical object. An improvement over [29] is presented in [30], in which neural network RL (NN-RL) and IBVS is used for control of robot manipulator. To enable online learning and flexibility with changing control tasks, the NN-RL algorithm is applied into a hybrid control system in [30]. In [31], a model-free RL strategy is introduced for the robotic grasping of unknown objects. In [32], the learning outcome of a generative model is directly used in real-time application. Also, asymmetric actor-critic and variational auto-encoder-based RL algorithm are designed to achieve the desired target. However, results on integration of RL and IBVS for tip-tracking control of the TLFM have not been reported yet in the literature, which motivates us to make an effort in this paper. Therefore, in this work, offpolicy RL controller is integrated with IBVS controller for accurate and robust tip-tracking control of TLFM is developed.
The objective of this paper is to develop a vision-based tiptracking control of TLFM, with a view of developing a novel adaptive intelligent IBVS controller. It consists of following contributions.
• An intelligent controller with off-policy reinforcement learning (RL) is developed to guarantee that the object remains within the camera FOV for accurate tip-tracking control of TLFM. • An adaptive intelligent IBVS (AI-IBVS) controller is implemented into the composite controller to enable the ability of self-learning and decision-making for robust tip-tracking control of TLFM.
The remaining sections of the paper are structured as follows.

Dynamics of TLFM
The dynamics of TLFM is given by [5] The matrices M, c 1 , c 2 , K , and D in (1) are, respectively, a positive definite symmetric inertia matrix, Coriolis and centrifugal force vectors, stiffness matrix, and damping matrix. The detailed theoretical TLFM model conversion and a comprehensive explanation of matrices of (1) are given in Appendix A.
In state space form, the dynamics of TLFM (1) can be expressed aṡ where x(t) ∈ 2n represents the state vector, y(t) ∈ m represents the output vector (or tip position), u(t) ∈ n denotes the control input, f i (x(t)) ∈ n is the drift dynamics of TLFM, g i (x(t)) ∈ n×m is the input dynamics and l(x(t)) is the output dynamics. A comprehensive explanation of matrices of (2) is provided in Appendix A.

Assumption 1
The system (2) has the following properties:

Lemma 1 If f (x(t))
is Lipschitz and f (.) = 0 (Assumption (1)), which is a typical assumption to ensure that the solution x(t) of the system (2) is unique for any finite initial condition, then Assumption (3) in Assumption 1 is satisfied for the system (2). On the other hand, some physical systems do meet this condition even though Assumption (4) is not appropriate for the considered nonlinear system (TLFM).

Problem formulation
The aim is to create control input u(t) for a system (2) such that state of the system x(t) shall track a desired trajectory x d (t) and stabilize the TLFM (by controlling link vibration). The tracking error is described as The control input u(t) for the robust tip-tracking control (RTTC) problem can be expressed as where u(t) denotes the TLFM's behavior policy that has to be modified. To bring the object within the FOV, the RL control input u rl (t) is used to correct the tip position of the TLFM.
To accomplish the visual servoing operation, IBVS control input u sp (t) is used.
The formulation of the RTTC problem can be split into two subproblems for TLFM when taking into account the overall dynamics of the system (2).

Problem 1
The control input is intended to correct the position of the TLFM's tip for the system (2) in order to maintain the object's FOV. Consider the following cost function where R 1 = R T 1 > 0 and Q 1 ≥ 0 are positive-definite function, and 0 < ψ ≤ 1 describes the constants used to discount future costs.
The Hamilton-Jacobi-Bellman (HJB) equation related to (5) can be used to determine the input u rl (t).

Remark 1
It is not possible to encode input constraints into the optimization problem by employing a non-quadratic performance function since only the feedback part of the control input u rl (t) is acquired by minimizing the cost function (5).

Remark 2
Note that singular perturbation (SP) approach [33] uses the gap between the fast and slow variables to separate overall dynamics into two reduced order system. In [5], presents decomposition of TLFM dynamic model into twotime scale by singular perturbation approach (slow and fast subsystems).

Problem 2
The control input u sp (t) for the system (2) is intended to, (i) ensure perfect tracking and, (ii) account for link vibration (for system stabilization). u sp (t) control input can be written as where u f (t) and u s (t) are control input for fast and slow subsystem, respectively.

Remark 3
The RTTC problem for the slow subsystem is to realize the tracking performance of x(t) to the desired trajectory x d (t) with minimum tracking error. The desired trajectory x d (t) can be achieved if e(t) → 0.
Therefore, a new formulation that provides both control inputs concurrently needs to be created. Due to RL's greater ability to address the RTTC problem without necessitating in-depth understanding of system dynamics, it has been successfully used in a variety of practical applications.

Solution to the robust tip-tracking control problem
In this section, two controllers for Problems 1 and 2 are designed. An actor-critic-based off-policy reinforcement learning controller is developed and new two-time scale IBVS controller [5] are utilized to deal with Problems 1 and 2, respectively. The proposed composite controller is termed as adaptive intelligent IBVS (AI-IBVS) controller.

Reinforcement learning
In RL, action-value methods have three major limitations that cause problems in real-time application and their convergence. First, their target policies are deterministic, where as many problems have stochastic optimal policies. Second, for larger action space, it is very difficult to find the greedy action with respect to action-value function. Third, a small variation in the action-value function results in major deviations in the policy that causes convergence issue for some real-time applications [34].
To overcome the limitations of action-valued methods, actor-critic methods are utilized. The on-policy actor-critic policy gradient algorithm is successfully used for learning in continuous action spaces in many robotics applications [35]. The on-policy actor-critic algorithm does not take advantages of off-policy learning. Off-policy algorithms make it possible to follow and collect data from behavior policy while learning a target policy. However, off-policy actor-critic algorithms are advantageous for real-time applications than action-value methods as well as off-policy actor-critic algorithms, because it presents the policy, as a results the policy can be stochastic and used large action space [34].
The memory structure of actor-critical techniques is independent, allowing them to present the policy without regard to any value function. The actor is called as policy structure, because it is used to update the control policy. The critic is called the estimated value function, because it is used to criticize the actions made by the actor.
In recent years, neural networks (NNs) have been widely employed for the control design of uncertain nonlinear systems since NNs have a good ability to approximate with less system knowledge. This ability of NN helps to cop-up with nonlinearity and uncertainty present in the TLFM. Therefore, NNs are used for approximation in the present work. The proposed RL controller comprises of two NNs: actor NN for generating control input by estimating the uncertain parameter or system information, and critic NN for approximating the cost function. For a continuous function f (Z ) : R k → R, following NN is applied uses Gaussian function. It has been established that NN is capable of estimating any continuous function over a compact set z ⊂ R k to any desired precision as where ε b is the bounded estimation error and W * is the ideal constant weight.

Off-policy RL algorithm
In order to develop off-policy algorithm, augmented system and value function need to be constructed. To determine tracking error defined in (3), desired trajectory is assumed aṡ , an augmented closed loop system can be constructed aṡ where, the augmented states are The value function in terms of the states of the augmented system thus produces where Q T ≥ 0 and R T ≥ 0 are positive-definite function.
The augmented system dynamics (11) is expressed as the off-policy RL algorithm.
where u j (t) denotes the policy that needs to be updated. In contrast, the behavior policy u rl (t) is the one that is actually applied to the dynamics of the system to produce the data for learning. Differentiating value function along with the dynamics (14) and using u j+1 Integrating both sides of (15) yields the off-policy RL Bellman equation Equation (16) is also known as off-policy Bellman equation, that yields the following off-policy RL algorithm.

Algorithm 1
Off-Policy RL Algorithm to Find the Solution of HJB 1: procedure 2: Given admissible policy u 0 3: for j = 0, 1, 2... given u j , solve for the value V j and u j+1 using off-policy Bellman equation on convergence, set V j+1 = V j , 4: Go to 3.

5: end procedure
The design of actor-critic structure is utilized to approximately the various function and control policy in order to build off-policy RL Algorithm 1. Design of actor-critic structure is given in Sect. 3.2.

Design of actor-critic-based off-policy reinforcement learning controller
Problem 1 is resolved by developing an actor-critic-based off-policy reinforcement learning controller. The structure of off-policy RL controller is depicted in Fig. 1. In Fig. 1, actor is used to update the desired control policy to minimize the cost function, critic is used to approximate the reward function/current state information and cost function, behavior policy is used to select/generate the action data/control input while learning about target policy for TLFM. The estimated/target policy is unrelated to policy that is evaluated and improved.

Design of critic NN
As the cost function (5) describes, the approximate error of cost function can be expressed as where φ(t) represent the instant cost function. As the constant ψ → ∞, the approximate error of the cost function can be represented as where and ∇ is the gradient of Z c . Equation (18) is also known as Bellman equation.
Critic weight (W c ) update Critic weight update law can be designed aṡ (19) where E c is the square Bellman error [17], i.e., defined as Substituting (20) in (19), one obtainṡ where l c > 0, which represents the learning rate of critic NN and ∧ = −(S c /ψ) + ∇S cŻc .

Design of actor NN
The dynamics of TLFM (1) can be rewritten as From (23), one obtains Substituting (24) into (22) gives Equation (25) can be expressed as The dynamic of TLFM (26) can be rewritten by considering where To achieve the control objective, the tracking error variables e 1 (t) and e 2 (t) are defined as where x 1d (t) is the control input and α 1 (t) is a virtual backstepping control variable to e 1 (t).
Using (27), derivative of (28) can be written aṡ Virtual control variable is selected as Define a candidate Lyapunov function V 1 = 1 2 e 2 1 (t). Its time-related derivative can be expressed aṡ To realize e 2 (t) → 0, we define candidate Lyapunov function V 2 = V 1 + 1 2 e 2 2 (t). Its derivative with respect to time can be written aṡ To realizeV 2 < 0, we choose where k 2 > 0 is the constant design parameter. Then (32) can be expressed aṡ From (33), the desired control law can be designed as However, to realize the control law (35), modeling information x 3 (t) are needed, which are difficult in practical engineering. In order to estimate the unknown information, actor NN must be introduced.
So, control law u rl (t) can be redefined as whereŴ a = W * a +W a is the neural weight estimation and a andW a are the ideal and instant neural weights, respectively.
The instant estimation error is expressed as Then, the actor NN error e a can be designed as where κ I is a positive constant and J d (t) ∈ N +1 is the desired cost.
Actor weight (W a ) update Actor weight update law can be designed aṡ where E a = 1 2 e a T (t)e a (t). Substituting (38) in (39), we geṫ where l a is the actor NN's learning rate. As ε a is unavailable, we can redefine update law aṡ

Stability analysis
Define a candidate Lyapunov function V c as Taking the time derivative of (42), and substitute (21) into (42), we havė As γ (t) → 0, Eq. (18) will become Substituting φ(t) from (44) to (43), one obtainṡ This means that when tracking error e(t) will be zero,V c will be negative definite, i.e.,V c ≤ 0 that will ensure the stability.
The following lemma can be used to demonstrate the closed loop system's boundedness. Lemma 2 [16] Candidate Lyapunov function V r (t) is bounded if the initial condition V r (0) is bounded, V r (0) ≥ 0 is continuous and the following equation satisfieṡ where λ and κ are both positive constant.
Define a candidate Lyapunov function as Its time-derivative can be expressed aṡ Substituting (41) into (48), one obtainṡ AsĴ Substituting (50) into (49), one obtainṡ where, where I represents an identity matrix, B r is positive constant, b ≤ and b s ≤ S a . Further following condition must satisfy to ensure κ > 0.
As per Lemma 2, V r (t) is bounded. Now, by using the subsequent theorem, the RL controller's boundedness is established.
Theorem 1 Consider the TLFM, with the proposed RL controller, the system parameters e 1 (t), e 2 (t),W c andW a are bounded, since the initial conditions are bounded. Also, the parameters e 1 (t), e 2 (t),W c andW a will eventually remain within the compact set e 1 , e 2 , W c and W a , respectively, which are defined as From (47) and (57), it can be observed that Then, one can obtain

Design of new two-time scale IBVS controller
A new two-time scale IBVS control scheme [5] is utilized in order to address Problem 2. The goal of the new two-time scale IBVS control scheme is to ensure tracking and stabilize the system in order to fulfil the visual servoing task (to damp out the vibration).

Model decomposition by two-time scale perturbation method
According to the SP technique, the design of a feedback control system for an under-actuated system can be divided into two subsystems: a fast subsystem for compensating tip deflection/vibration and a slow subsystem for measuring and controlling tip position. The state variable of the TLFM dynamic model (1) can be expressed using SP theory as where ε s = 1 √ k is the SP parameter with the common stiffness coefficient scale factor, and the overbars indicate the slow part of each variable. The fast parts of the variables z 1 and z 2 are η 1 and η 2 , respectively. The slow subsystem is described aṡ The fast subsystem can be expressed as In terms of η 1 and η 2 , the fast subsystem can be defined as where H = M −1 , T = t ε s is the fast time scale, u f and u s are the fast and slow control signal, respectively.
With respect to (61) and (63), the slow and fast components of the tip position variables and the deflection variables change, respectively. Consequently, using the composite control theory, the TLFM's control input can be written as whereū f and u s are the fast and slow control inputs, respectively. u f (x 1 , 0, 0) = 0, i.e., fast control signal is not needed during trajectory tracking with slow subsystem (61).

Slow subsystem controller
Shifted moment-based IBVS is used to create the u s (t) for the slow subsystem. Two moment-based visual features are required to control the 2-DOF of TLFM, according to [36].
To adjust the 2-DOF of the TLFM and decrease the sensitivity of the data noise, a low order shifted moment-based visual feature is applied. These are three polynomials that were calculated using shifted moments. Here are the polynomials of orders 2 and 3 that were constructed from shifted moments [37].
By integrating three different types of moment invariants (invariant to translation, to the 2D rotation and to scale), two visual features with shifted moments are chosen from two invariants from (65) and (66). The L s θ interaction matrix for the two shifted moment-based visual features that regulate the 2-DOF of the TLFM can be represented as where, From a binary or a segmented image, the analytical form of the interaction matrix corresponding to every moment can be calculated.
The purpose of a shifted moment-based IBVS controller is to ensure that the real visual feature approaches the desired visual feature asymptotically. For the slow subsystem, the control input is designed for guaranteed accurate/perfect tracking. It is designed using IBVS approach.  (69) can be derived in the similar fashion as adopted in [5]. In (69), L s = L μ s i j is the interaction matrices related to shifted moment (67) of the tip with respect to the position variables [5].
The interaction matrices for the shifted tip moment (67) with regard to the position variables are represented by L s = L μ s i j in Eq. (69) [5].
To achieve the objective of shifted moment-based IBVS controller, the formulation of problem is described in the following steps: 1. Initially, pre-processed captured image-based features based on shifted moments are extracted. 2. Interaction matrix is estimated from features, which are extracted from shifted moments in previous step. 3. Camera/tip velocity or acceleration for robot controller to be calculated from estimated interaction matrix related to visual features. 4. Then camera/tip is move to reach desired position unless and until error of image features is minimized. When the features align with the desired ones, the visual servoing work is finished. Figure 2 shows the IBVS flow control algorithm, in which s * is the desired image features and s is the current value of image features. For a closed-loop system (61), it is necessary to construct an IBVS-based shifted moment control strategy so that the output trajectory closely tracks the reference output trajectory. As stated in [38], slow control input is planned as

Fast subsystem controller
Here, the fast subsystem of the TLFM is controlled by the LQR controller. A state observer is typically required in fast controllers to estimate the immeasurable modal coordinates. The best option for closed-loop system stability and robustness against time delay is a Kalman filter based on a fast model that contains the first three modes and a fast feedback that dampens the first mode only [38]. For the fast subsystem, consider the following cost function where Q 2 and R 2 are positive definite symmetric matrices, is the feedback gain. After minimizing the cost function (71), the fast subsystem control input is represented by Equation (72) can be derived in the similar fashion as adopted in [5].
The new two-time scale IBVS control law u sp (t) is derived from (70) and (72) to solve Problem 2.

Proposed adaptive intelligent IBVS controller for TLFM
The new two-time scale IBVS controller presented in Sect. 3.3 is a summary of work presented in [5] has the following practical problem: (1) the proposed controller cannot guarantee the retention of visual features within the camera FOV, (2) increased input torque results from increased controller gain, which causes the visual feature to move out of the FOV more quickly, resulting in system instability and inaccurate system performance. In this section, the design of a novel adaptive intelligent IBVS (AI-IBVS) Controller for robust tip-tracking control of TLFM is presented in order to address the visibility issue of the proposed new two-time scale IBVS controller. The proposed AI-IBVS controller design is depicted in Fig. 3; it is discussed in Sect. 3. To increase the reliability of vision-based tip-tracking control of the TLFM, RL-based adaptive intelligent IBVS controller is built. The position of the tip is corrected by the proposed RL controller (36) and the new two-time scale IBVS controller (73). The proposed RL controller brings the visual feature on the FOV by choosing the best control input, while the new two-time scale IBVS controller moves the tip of the TLFM in the direction of the reference target. In particular, the controller will employ the AI-IBVS controller to learn and choose the best control input u(t) for the robot under the current state. The TLFM's RL controller will receive the optimal control input to direct the visual features into a desirable or safe region of the image plane. The reward is used to update the actor-critic weight of the action under the world state after the TLFM takes action. The reward is computed based on the updated position of the visual features on the image plane.
The image plane in Fig. 4 is arranged as a discrete grid with 40 pixels per cell that is 16 × 12. It is divided into three areas: desirable, safe, and undesirable. If the image features is present in the desired/safe region, a new two-time scale IBVS controller is employed. If not, an RL controller is employed. As a result, the proposed AI-IBVS controller ensures the presence of visual features inside the FOV.
When a vision sensor captures an image, it is simple to translate the location of the visual features on the image plane into coordinates in the grid world using the formulation below: For each state, the RL controller is only expected to take two actions. The default value of w x or w y is 2 degrees per second for the tip/camera rotational velocity. Therefore, one of these actions is used in each stage or iteration depending on the location of the visual feature in the image. update critic weights (21) 8: update actor weights (41) 9: generate action data (control input) 10: u rl (t) is computed from (36) 11: until (X , Y ) ∈ in the desirable/safe area 12: else if (X , Y ) ∈ in the desirable/safe area then 13: repeat 14: interaction matrix estimation 15: estimation of error vector 16: u sp (t) is computed from (73) 17: until visual servoing task is achieved.

18: end if
The environment will reward the TLFM after it takes an action. Based on the placement of visual features, the reward where (X , Y ) is the new coordinate of the grid world in the image plane, after the TLFM takes action. It is obvious from (75) that the reinforcement signal rewards actions that keep visual features inside the FOV by forcing them into the desirable part of the image plane and punishes them when they are in the undesirable area. To accomplish the TLFM's vision-based tip positioning task, the AI-IBVS Algorithm 2 is used.

Results and discussion
In this section, performance of proposed AI-IBVS controller is analyzed by simulation studies. The proposed controller is evaluated using machine vision toolbox for MATLAB [39]. The physical TLFM parameters taken into account for simulation studies are listed in Table 1. Tasks-1 and task-2 in this study are referred to as tip positioning with symmetrical and non-symmetrical objects, respectively.

Training procedure
The critic NN and actor NN are set as fully connected NNs with a hidden layer, an input layer, and an output layer in the actor-critic-based off-policy RL controller. Given that the size of the feature column is five, the input layer has six neurons. Two neurons in the output layer correspond to each state's two RL controller actions. There are six neurons in the hidden layer. The learning rates, i.e., l c of critic NN is set as 0.6 and l a of actor NN is set as 0.9.
Six activation functions are present in the hidden layer and two activation functions are present in the output layer for the actor and critic NNs. The actor and critic NN is utilized, which employs the backpropagation algorithm, a hyperbolic tangent (nonlinear) activation function for the hidden layer, and a liner activation function for the output layer. The hyperbolic tangent activation function is differentiable; therefore, it can be easily employed in backpropagation (derivativebased) learning algorithm. The output of actor and critic network is RL control input u rl (t) for TLFM. The RL control input of hub-2 for task-1 and task-2 are shown in Figs. 5 and 6, respectively.

Tip-tracking performance
The effectiveness of the proposed controller is evaluated for two distinct object shapes: the symmetrical object (rectangle) and the non-symmetrical object (whale). The object in the initial position of the visual servoing task is not in the FOV. In this work, the TLFM uses the AI-IBVS controller to perform the tip-tracking task for both objects with small undesirable areas. The undesirable region is described as 80 > r > 560 or 80 > c > 400 (76) Figure 7 depicts the unwanted area, which is the outer part of the white bounding box; the remaining space is thought to be safe and desirable.

Tip-tracking performance for task-1
Figures 7 and 8 depict the task-1's desired location and initial position, respectively. Because the object centroid on the image is initially in an undesirable location, specifically at (608, 224), RL controller is employed to correct the TLFM position. The history of pixel coordinates for a visual feature is shown in Fig. 9. As seen in Fig. 9, the RL controller only takes six steps to put the visual feature inside the image plane's safe area, or within FOV.
A new two-time scale IBVS controller becomes active to finish the visual servoing task once the object enters the FOV. With the invariants r s5 and r s6 that are acquired from (66), the interaction matrix (67) is computed for the required position. Table 2 gives the initial and expected values of selected image features. Observed condition number is 2.49, which is

Tip-tracking performance for task-2
Figures 11 and 12 show the desired position and initial position of task-2, respectively.
Because the object centroid on the image is initially in an undesirable location, specifically at (585, 220), RL controller is chosen to adjust the TLFM position. The history of pixel coordinates for a visual feature is shown in Fig. 13. As seen in Fig. 13, the RL controller only takes five steps to put the visual feature into the image plane's safe area, or within FOV.
A new two-time scale IBVS controller becomes active to accomplish the visual servoing task whenever the object enters the FOV. With the invariants r s4 and r s6 that are derived from (66), the interaction matrix (67) is computed for the required position. Table 2 gives the initial and desired values of selected image features. It is seen that the condition number is 3.89, which is satisfactory. The image feature errors are shown in Fig. 14. As can be seen in Fig. 14, for task-2, the feature errors converge to zero after 42 s.
The task-1 and task-2 results indicate that the AI-IBVS controller is able to quickly correct the tip position of the TLFM when the visual feature is in an undesirable area or outside of FOV, allowing the visual feature to move through a significant distance as quickly as possible into the safe area to complete the visual servoing task.
In addition, the detailed study on coordinate vector relative to coordinate frame is included in Appendix B, in which the position and orientation (pose) of the object coordinate frames with respect to the base coordinate frame are highlighted.

Comparison
In this work, the important difference between the proposed control scheme as compared to other schemes [29][30][31][32] is presented as follows. First, the control scheme in [29][30][31][32] is not intended for flexible manipulators. Second, in order to prevent joint damage, it is not advised for a robot manipulator    to transition between two controllers in the hybrid scheme presented in [29]. Third, in [29,31], a typical Q-learning algorithm with offline training is implemented in the hybrid system, while in [30], two RL algorithms with NN are sep-arately constructed and in [32], asymmetric actor-critic and variational auto-encoder-based RL algorithm are designed, making the control scheme complex. The proposed AI-IBVS controller possesses the capabilities of self-learning and decision-making and provides a balanced performance to complete the visual servoing task similar to [29][30][31][32].

Conclusion
In this work, an adaptive intelligent IBVS (AI-IBVS) controller for two-link flexible manipulator (TLFM) is developed. The challenges with IBVS and the retention of visual details in the FOV are specifically covered in this work. A wise selection of shifted moment-based visual features has been made in the new two-time scale IBVS controller to address the problems of singularity and local minima in IBVS. Therefore, in order to retain the object within camera FOV, an intelligent controller with reinforcement learning (RL) is proposed here. Moreover, a composite controller for TLFM is developed to combine RL controller and IBVS controller. Simulation have been performed to investigate the performance and robustness of the proposed controller. The results demonstrated that the proposed controller can successfully complete the visual servoing task by quickly correcting the tip position to bring the object within FOV. The proposed control scheme will be implemented and adapted in the real-time flexible manipulator in future studies.
Author contributions All authors contributed to the study's conception and design. Material preparation, data collection, and analysis were performed by Umesh Kumar Sahu, Dipti Patra, and Bidyadhar Subudhi. The first draft of the manuscript was written by Umesh Kumar Sahu and extended by Dipti Patra and Bidyadhar Subudhi. Umesh Kumar Sahu contributed simulations evaluation. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. Data availability Not applicable.
Code availability Not applicable.

Conflict of interest The authors declare no conflict of interest.
Ethics approval Not applicable.

Consent for publication Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix A Dynamics of TLFM
The dynamics of FLM is a distributed parameter system owing to the distributed link flexure. Due to distributed link flexure, the positioning and tracking of the tip in case of a TLFM are very difficult. In this case, it is assumed that motion of the TLFM in the horizontal plane, the links have uniform material properties and have constant cross-sectional area [40]. The schematic diagram of TLFM with a tip mounted camera is shown in Fig. 15, where X b O b Y b is the fixed coordinate frame with the joint of link-1 located at world coordinate X w O w Y w . X 2 O 2 Y 2 andX bÔbŶb are the rigid and flexible body moving coordinate frame, respectively, of ith link and is fixed at the joint between link-1 and link-2. τ i represents the applied torque of ith link, θ i represents the joint angle of ith joint, and y i (l i , t) denotes the deflection along ith link. The complete system behaves as a non-minimum phase system, when the tip position is taken as the output. The actual output vector y pi is considered as the output for the ith link. Hence, the redefined output can be written as where l i is the length of ith link.
The dynamics of flexible links are derived as Euler-Bernoulli beams with deformation y i (l i , t) for ith link satisfying the link partial differential equation where ρ i and (E I ) i represent the density and flexural rigidity of the ith link, respectively. The finite-dimensional expression for y i (l i , t) can be presented using the AMM [1] as where ϕ i j and δ i j denote jth mode shape and modal coordinate of the ith link, respectively, and n is the number of assumed modes. The dynamics of TLFM is derived by using the energy principle and the Lagrangian formulation technique along with AMM. The total Lagrangian (L) can be defined as where q i is the ith generalized coordinates, i.e., q i = [θ iθi δ iδi ]. In (A4), total Lagrangian (L) value is substi-tuted, i.e., difference of total kinetic energy and total potential energy of the TLFM and solve for the q i generalized coordinates. The dynamics of TLFM is expressed in (1). The details of the matrices and vectors of (1) are where M rr and M f f describe the positive definite submatrix related to rigid and flexible variable, respectively. M r f = M f r representing coupling between the rigid and the flexible displacement variable. k i j = ω 2 i j m i with ω i j is natural frequency of jth mode and ith link, and m i is the mass of ith link. The damping matrix, D = diag{d i j } for jth mode of ith link. θ i andθ i are the joint angle and velocity of the ith joint, respectively. δ i andδ i are the modal displacement and velocity for the ith link, respectively. τ i is the actual applied torque for the ith link.
The matrices and vectors of state space model of TLFM presented in (2) are

Appendix B Pose of Coordinate Frames
With reference to Fig. 15, the object coordinate frame  In the pose representation, the superscript denotes the reference coordinate frame and the subscript denotes the frame being described [39]. The pose of {o} relative to {b} can be expressed as where ⊕ is used to indicate composition of relative poses. For TLFM, the pose of the object relative to base coordinate is expressed as where R(·) and T (·) represent the rotational and transnational motion of coordinate frame.