A tractor-trailer parking control scheme using adaptive dynamic programming

This paper studies the online learning control of a truck-trailer parking problem via adaptive dynamic programming (ADP). The contribution is twofold. First, a novel ADP method is developed for systems with parametric nonlinearities. It learns the optimal control policy of the linearized system at the origin, while the learning process utilizes online measurements of the full system and is robust with respect to nonlinear disturbances. Second, a control strategy is formulated for a commonly seen truck-trailer parallel parking problem, and the proposed ADP method is integrated into the strategy to provide online learning capabilities and to handle uncertainties. A numerical simulation is conducted to demonstrate the effectiveness of the proposed methodology.


Introduction
Parking a truck-trailer is a problem frequently studied in the fields of automated and autonomous trucking, robotics, and nonlinear control (see, for example, [2,10,25,29,30]). In particular, the backward steering control of wheeled multiple vehicles have been studied using neural network [23], fuzzy logic [11,35], and other learning algorithms. Different from adaptive cruise control, lane-keeping, lane-changing, or any other control actions typically happen on the highway or on secondary roads, truck-trailer parking maneuvers mostly occur in closed off-highway environment, such as cargo yards, distribution centers, or intermodal facilities. Thus, truck-trailer parking maneuvers have a few distinctive features. First, the vehicle speed is low, and the effects of tire slip can be ignored. Second, compared with lane-keeping tasks, much higher lateral accuracy is required for trailer B Yu Jiang yjiang@gudsen.com; yu.jiang@nyu.edu Chenyong Guan cy@gudsen.com 1 Gudsen Technology Co., Ltd., 6/F, 10th Building, Jiuxiang Ling Industrial Park, Ave Xili, Nanshan District, Shenzhen 518000, Guangdong, China 2 Gudsen Engineering Inc, 844 Highland Ave, #533, Needham, MA 02493, USA parking maneuvers. Third, backing up a truck-trailer system involves dealing with a naturally unstable equilibrium [2]. Fourth, quite a few different types of uncertainties, such as wheelbase length, load balance, and worn chassis, can cause the truck-trailer dynamics to deviate from nominal models.
To address uncertainties and to apply data-driven approaches that gradually improve the controller performance, this paper resorts to the theory of adaptive dynamic programming (ADP), which is a class of approximate methods of solving optimal control problems (see [5][6][7]27,31,36,[41][42][43][44] and the references therein). ADP avoids the inherent curse of dimensionality problem of classical dynamic programming [4], and it is usually considered as a class of methods in reinforcement learning [33] in that it learns the optimal solution through iterations between approximating the value function and improving the action [24]. ADP has been extensively studied for Markov decision processes (see, for example, [6,27]), as well as dynamic systems (see the review papers [21,39]). Stability issues in ADP-based control systems design are addressed in [3,20,38]. A robustification of ADP, known as robust-ADP or RADP, is developed by taking into account dynamic uncertainties [15,17]. Another related work in robust dynamic programming is reported in [8]. Sum-of-squares program [28] has been introduced to ADP with a relaxation technique to achieve global asymptotic stabilization of nonlinear polynomial system in [14]. ADP methods for different type of game theories have been studied in [32,37,40]. ADP-based tracking control design can be found in [9,13,45], just to name a few. Some recent developments of ADP in control systems can be found in [16] and the references therein.
When dealing with static uncertain nonlinearities, neural network and general universal approximation methods [12,26] are widely adopted in ADP to approximate the cost function and the control policy. However, ADP with universal approximators may have at least two shortcomings. First, a large number of basis functions are usually required. Hence, it may incur a huge computational burden and slow adaptation for the learning system. Second, when the target function to approximate is treated as a black-box, it is not trivial to manage the approximation error to avoid it from being amplified across iterations, especially when implemented online, not to mention sometimes instability can be caused due to small approximation error.
In practice, many engineering systems do not need to be treated as black-boxes, because certain knowledge about the system, although limited, could be obtained prior to design ADP-based controllers. Indeed, quite a high percentage of engineering systems, such as the truck-trailer system studied in this paper, can be parametrized with a known small set of basic functions and uncertain parameters, of which the range can also be quantified. In this way, no heavy computation is needed during policy evaluation or policy improvement. Also, the potential approximation error is eliminated theoretically. Thus, the two shortcomings with universal approximation approaches when integrated into ADP-base online learning are addressed, as long as the system in question can be parametrized. This paper will develop such an approach with detailed analysis.
In summary, the major contributions of this paper are twofold. First, a novel ADP methodology is proposed to learn the optimal solution of the uncertain linearized system, while at the same time to handle parametrized nonlinear uncertainties during online learning. Second, the proposed ADP method is incorporated into a truck-trailer parking control strategy, designed for a commonly seen truck-trailer parallel parking problem.
The remainder of this paper is organized as follows. The next section formulates the problem and introduces some basic results regarding nonlinear optimal control followed by which a novel ADP method for nonlinear systems with parametric uncertainties is developed. The subsequent section details a specific truck-trailer control problem, with analysis on its dynamics. Then a human-inspired control strategy to achieve parallel parking in the presence of parametric uncertainties is developed. This control strategy integrates the proposed ADP method. In the penultimate section, the numerical simulation results to validate the efficiency and effectiveness of the proposed method are summarized. The final section gives concluding remarks and points out potential topics for future work.
Notation Throughout this paper, we use R and Z + to denote the sets of real numbers and non-negative integers, respectively. Vertical bars · represent the Euclidean norm for vectors, or the induced matrix norm for matrices. We use ⊗ to indicate Kronecker product, and vec(A) is defined to be the mn-vector formed by stacking the columns of A ∈ R n×m on top of one another, i.e., vec(A) = [a T 1 a T 2 . . . a T m ] T , where a i ∈ R n are the columns of A. A control law is also called a policy. A feedback gain matrix K ∈ R m×n is said to be stabilizing for linear systemsẋ = Ax + Bu if the feedback matrix A − B K is Hurwitz.

Problem formulation
This paper studies uncertain nonlinear systems that can be represented in the following form: where x ∈ R n is the system state, u ∈ R m is the control input, A(x) ∈ R n is a smooth and uncertain state-dependent vector, and B ∈ R n×m is an uncertain constant matrix. The system is assumed to be controllable at the origin.

Remark 1 Without loss of generality, we can assume
where A = A(0) ∈ R n×n and ΔA ∈ R n×q are uncertain constant matrices, and σ (x) ∈ R q is a known vector of linearly independent functions of x, vanishing at the origin. Also, The control objective is to design an ADP-based control system that learns, through online data, the optimal control policy that minimizes the performance index of the system (1) linearized at the origin. It is assumed that there exists a constant matrix C with suitable dimensions such that the weight matrix Q ∈ R n×n satisfy Q = C T C and the pair (A, C) is observable. The other weight matrix R ∈ R m×m is required to be symmetric and positive definite.

Remark 2
A and B are referred to as uncertain constant matrices, in the sense that their precise values are not required to be known. In practice, it is always reasonable to have a good estimate of the range of uncertain parameters, and a stabilizing state-feedback gain K 0 , although not necessarily optimal, can be assumed.

Remark 3
The formulated problem is strongly related to the robust-ADP [15] problem but is also slightly different. In this paper, no dynamic uncertainty is considered, and the goal is to learn the optimal control policy for the linearized model. The online learning process has to be robust against the nonlinear perturbed term ΔAσ (x). Further, the learned control policy is optimal for the linearized model, and the closed-loop system comprised of the original system (1) and the control policy is locally asymptotically stable at the origin.

Linear optimal control and policy iteration
By linear optimal control theory [22], solutions to the problem described in "Problem formulation" can be found by solving the well-known algebraic Riccati equation (ARE) if A and B are accurately known. In addition, under the assumptions mentioned above, (4) has a unique symmetric positive definite solution P = P * , and the optimal control policy is in the form of where the optimal feedback gain matrix K * is then be determined by One of the numerical methods for solving (4) is developed in [18] and summarized in Theorem 1 below. This methodology is related to policy iteration as in reinforcement learning [34], since it starts with a stabilizing feedback control policy, and during each iteration the associated LQR cost is computed and then used for improving the policy.
Theorem 1 Let K 0 ∈ R m×n be any stabilizing feedback gain matrix, and let P k be the symmetric positive definite solution of the Lyapunov equation where K k , with k = 1, 2, . . . , are defined recursively by Then the following properties hold: The iteration algorithm described in Theorem 1 has guaranteed convergence. However, it does require the perfect knowledge of the system matrices A and B. A novel ADP methodology to implement the same iteration while not using the knowledge of system matrices but online data measurements will be developed next.

Adaptive dynamic programming and parametric uncertainties
In this section, a novel approach to learn the linear optimal controller that solves the problem in "Problem formulation" will be developed. This approach makes use of the data generated from the nonlinear plant (1), without the need to identify any uncertain system parameter.
To begin with, let K k be a stabilizing control gain matrix, and let P k denote the symmetric, positive definite, and unique solution to the Lyapunov function (7). Next, apply the following control policy with e an exploration noise. Then, along the trajectories of the closed-loop system comprised of (1) and (9), it yields that Together with (7), it follows that Next, combining with (8), and defining L k = P k ΔA we have Now, given any finite time interval [t, t +δt], we can integrate both sides of (12) with respect to time on the interval to obtain It is easy to see that the pair (P k ,K k+1 ) satisfying (7) and (8) must satisfy (13), which illustrates a way of solving (P k ,K k+1 ) with linear regression. Indeed, defining and using Kronecker product representation, (13) can be rewritten as It is not difficult to notice that if the same process of deriving (15) is applied to multiple time-interviews, we can then obtain a set of equations in the form of (15) to solve for the P k , K k+1 , and L k .
To see this, let where 0 < t 1 < t 2 < · · · < t l k+1 , with l k+1 a sufficiently large integer. Then we have Note that if the linear equation (20), together with P k = P T k , has a unique solution, then solving them amounts to solve both (7) and (8). Hence, let us impose the following assumption.
Assumption 3 Given a stabilizing K k and an exploration noise e(t), there exists a sufficiently large integer l k > 0, such that rank(Θ k ) = n(n + 1) 2 + nq + nm.
Lemma 1 Under Assumption 3, given a stabilizing K k , the P k = P T k and K k+1 computed from (20) must satisfy both (7) and (8).
Second, under Assumption 3, all the columns other than the n(n−1) 2 duplicated ones are linearly independent. That means, if we restrict P k to be symmetric, the solution of (8) is unique. Hence, the P k and K k+1 of that unique solution must satisfy (7) and (8). Now, we are ready to give an online policy iteration scheme. Similar as in other policy-iteration-based iterative schemes, a stabilizing feedback gain matrix K 0 is assumed.

Algorithm (1) Initialization:
Find a stabilizing feedback gain matrix K k with k = 0. (2) Online Data Collection: Apply the following control policy to system (1) Then construct the linear regression matrix and incrementally increase l k ∈ Z + , until the rank condition (21) is satisfied.

(3) Policy Evaluation and Improvement:
Solve for P k , K k+1 , and L k from (20). Then, go to Step 2) with k replaced by k + 1.
The convergence of Algorithm 3 is guaranteed under Assumption 3 and is summarized in the theorem below Theorem 2 Under Assumption 3 and given a stabilizing K 0 , we have where K k+1 , P k , and L k are obtained from Algorithm 3, for k = 0, 1, 2, . . . , P * is the optimal solution of (4) and Proof Under Assumption 3, the iterations in Algorithm 3 is equivalent to the ones in (7) and (8). Then, by Lemma 1, the results hold. Thus, the proof is complete.
In practical implementation, one can introduce a predefined threshold > 0 to check if and to determine if exploration noise e and online learning are still needed. Thus, the exploration/exploitation trade-off can be balanced. Indeed, a larger may lead to shorter exploration time and therefore will allow the system to implement the noise-free control policy sooner. On the other hand, using a smaller > 0 allows the learning system to better improve the control policy but longer learning time may be needed to achieve desired convergence.

Problem description
The truck-trailer system considered in this paper is with an on-axle hitch, which lies on the real axle of the truck [1]. A typical truck of this type are referred to as a terminal tractor, also known as a yard truck. It is an off-highway semi-tractor intended to move semi-trailers, within a cargo yard, warehouse facility, or intermodal facility. One typical use case of a terminal tractor is shown in Fig. 1, where a trailer needs to be parked into a parallel parking spot. Due to the space limitation, there are obstacles on each side and the width of the aisle available for making maneuvers is usually no more than 20 m. There can be other sporadic obstacles in the aisle, such as over-length trailers and other temporarily parked tractors.

A truck-trailer model
We consider a truck-trailer system as shown in Fig. 2, and can be presented in the following form: [19] x = v cos(θ ) (24) where x and y are the coordinate of the reference point (rear wheel axle) of the truck, or the location of the truck kingpin; v is the longitudinal velocity measured at the reference point; δ is the steering angle; θ is the orientation/heading of the truck; γ is the relative angle between the truck and the trailer, note that γ + θ gives the orientation of the trailer; D is the wheelbase of the truck and L is the wheelbase of the trailer. Without loss of generality, our control objective is to drive the truck and trailer to the origin (x, y, θ, γ ) = 0. Indeed, if the target position is not at the origin of the current coordinate system, we can always create a new coordinate at the target with the desired truck heading being the new x-axis and establish the transformation between the two coordinates.

Longitudinal and lateral dynamics and control
The truck-trailer dynamics (24)-(27) is highly nonlinear and under-actuated. To simplify the problem, we separate the consideration of the longitudinal and lateral controllers.
where v 0 > 0 is a constant, and sgn is the sign function. In addition, as soon as the truck-trailer hits the obstacle, which is inflated with safety margin, the speed is immediately set to zero. As for lateral dynamics, it can be observed that if the steering control policy only depends on the four state variables, then the overall path geometry of the truck-trailer system is independent of the longitudinal velocity, as long as it is nonzero and does not change signs. Furthermore, the linearized system of (24)- (27) at the origin is uncontrollable. However, since the movement along the x-axis can mostly be taken care of by the longitudinal control, the lateral control policy only needs to focus on the lateral error dynamics (25)- (27), of which the linearized system at the origin is controllable. Indeed, the lateral dynamics for forward movement becomeṡ and one for backward movement iṡ It is easily understandable that driving a truck-trailer backward is much more difficult than driving it forward. This can be seen by comparing the γ -subsystem in (31) and (34). Indeed, with u ≡ 0, the linearized γ -subsystem has a negative eigenvalue − 1 L for forward maneuvers, but a positive eigenvalue 1 L for backward maneuvers. In other words, if a truck is moving forward along a straight line, the hitch angle γ will converge to 0. However, if it is move backwards along a straight line, γ will quickly diverge.

Steering control
Based on the analysis made in "Longitudinal and lateral dynamics and control", we design the forward and backward steering control policies can be designed differently.
First, in the presence of the uncertain parameter D of which the range is known, it is practical to design a linear control policy that locally stabilizes the subsystem comprised of (29) and (30). Due to the stable nature of the γ -subsystem, the overall system (29) together with the control policy (35) is locally asymptotically stable. Of course, one can make the control policy depend on γ , but the performance improvement is less likely to be significant in the parallel parking problem. Second, for backward movement, the steering control policy to be designed is a linear feedback controller that stabilizes (32)- (34), and at the same time minimize a given cost function (3). This problem could not be solved directly using conventional LQR approach, due to the uncertain parameters D and L. Thus, we will apply the ADP approached developed in this paper to find the control policy for backward steering.

A strategy inspired from human drivers
A standard parking strategy we observed from human drivers involves at least two intermediate spots, as shown in Fig. 3.  Fig. 3 Intermediate targets to achieve the parallel parking maneuver.
To begin with, a human driver would first aim at a target left to the goal, and make the trailer pointing approximately the spot on the right side of the target spot. Once the truck-trailer is getting sufficiently close to the final target after backing, the intermediate spot is then chosen to be the one at the center front An experienced human driver would first drive towards intermediate target 1, at which the end of the trailer would point towards the spot on the right side of the desired spot. Then the driver would drive backwards to reach the final target.
If the accuracy at the end of the backward movement is still large, the forward movement is repeated. Nevertheless, if the reference point of the truck has small lateral error, the next action to take would be a forward movement towards the intermediate target 2, followed by a backward movement to reach the final target. This back-and-forth adjustment can be repeated until desired accuracy is met.
Here, we follow the general strategy as a human driver would take, but design feedback controllers to automate the steering and speed control. In addition, the ADP-based methodology is incorporated to the steering control for backward movements.

High-level control strategy involving ADP
We assume all four state variables are instantaneously measurable. Indeed, the location of the truck and its orientation can be accurately measured by real-time kinematic (RTK) GPS. The hitch angle can be measured by a physical encoder, or by the means of computer vision (see [10], for example).
Next, to solve the parallel parking problem, we propose a high-level control strategy as shown in Fig. 4. From starting position, the truck-trailer will first drive forward towards target 1, until the longitudinal criterion is reached or a safety margin between the truck and the obstacle is met. Then the truck starts to back-up the trailer towards the dock door. When backup is complete, the parking error is evaluated to decide if a pull-up adjustment is needed, and which target should be aimed for. The steering control policy is always fixed when the truck is moving forward. As for backward driving, the steering control policy will be updated via the proposed ADP method whenever a backing maneuver is finished.

Change of coordinates for non-origin targets
The proposed control design methodology is based on the assumption that the truck-trailer always need to be stabilized at the origin. Therefore, when we are making the truck-trailer to reach a target that is not at the origin of the current coordinated system, we need to compute the error signals in a new coordinate originated at the target, such that a desired control input can be correctly computed. Indeed, we can simply perform the following coordinate transformation: and feedback control gains are to be applied to these converted error signals.
Note that he steering angle and the control input u remain the same, under different coordinates.

Simulation setup
The simulation is programmed and conducted in MATLAB R2020a. All the ordinary differential equations are solved using forward Euler method with a fixed time step of 0.2s. Simulation code are fully accessible in the GitHub repository https://github.com/yu-jiang/padp.
The truck wheelbase is set to 3.0 m and the trailer wheelbase is set to 11.0 m. Note that both these values are blind to the controller. The Q and R weight matrices are set to be Q = I 3 and R = 100.
We assume the initial steering control policies, or the feedback gains, are computed for a truck-trailer system with D = 3.0 m and L = 6.0 m. For backward steering, and the weight matrices were Q = I 3 and R = 1000. Hence, the gain is computed as As for forward steering, we only focus on the truck dynamics, i.e., (29) and (30) with γ ≡ 0, since the trailer subsystem is local stable by itself when moving forward. Then we set Q = I 2 and R = 1000 and solve for the first two elements of the control gain, and let the third element be zero. Thus, the forward controller gain becomes We simulated the "no-learning" as well as the "learning with ADP" scenarios, and they both start with the same control policies. The "no-learning" scenario always stuck to the same control policies, while the "learning with ADP" scenario kept updating the backward steering control policy as soon as a backward maneuver is complete. In both scenarios, we set five as the maximum number trials allowed.

Simulation results
The simulation results are shown in Figs. 5 and 6. An animation of the full-simulation can be found at https://youtu.be/ CFPBQ_DP4Nc.
Each forward and backward movement combination in this simulation is referred to as a trial. In the first trial, one can see that in both cases (i.e., no learning, and with ADP-based learning), the performance is the same. This is because the ADP-based control strategy uses the same initial control policy as the no-learning case. However, starting from the second trial, the ADP-base control strategy starts to perform online learning and gradually modify the control policy towards the optimal solution. After four trials, the ADP-based control strategy has parked the trailer into the spot with insignificant lateral error. Due to the control strategy, the intermediate point switched to the front of the spot, and the last trial resulted in desired parking accuracy. On the other hand, without online learning and always under the initial control policy, the truck-trailer system did not make any notable progress after five trials, compared with the initial condition before trial one.
Finally, after five iterations, the learned feedback control gain matrix for backward steering is For validation purpose, we computed the ideal feedback gain K * b using the precise system matrices, and obtain Similarly, the estimated cost matrix after five iterations P (5) and the ideal optimal cost P * are P (5) The approximation error is expected to be further reduced as the iterative process continues.

Conclusions and future work
In conclusion, this paper has presented a novel and practical ADP approach to handle nonlinear systems with parametric uncertainties. The proposed methodology makes use of the online data directly measured from the nonlinear plant and can learn the linear optimal controller with respect to the linearized system at the origin. Then, the methodology has been integrated into a control strategy to achieve precise truck-trailer parking, giving parametric uncertainties from the environment. There are several related topics deserve further investigations in the future. First, it is interesting to figure out how to extend the proposed ADP method into more general trucktrailer maneuvers, such as alley dock backing in which curves are to be tracked. Second, in this paper we only incorporated ADP into the backward steering control, it will be very useful to see if ADP-like ideas can be developed to dynamically change the intermediate points and actively avoid obstacles. Finally, conducting real-world experiments with truck and trailers can further demonstrate the effectiveness of the proposed method.

Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.