Collision probability reduction method for tracking control in automatic docking / berthing using reinforcement learning

Automation of berthing maneuvers in shipping is a pressing issue as the berthing maneuver is one of the most stressful tasks seafarers undertake. Berthing control problems are often tackled via tracking a predefined trajectory or path. Maintaining a tracking error of zero under an uncertain environment is impossible; the tracking controller is nonetheless required to bring vessels close to desired berths. The tracking controller must prioritize the avoidance of tracking errors that may cause collisions with obstacles. This paper proposes a training method based on reinforcement learning for a trajectory tracking controller that reduces the probability of collisions with static obstacles. Via numerical simulations, we show that the proposed method reduces the probability of collisions during berthing maneuvers. Furthermore, this paper shows the tracking performance in a model experiment.


Introduction
Autonomous vessels are becoming increasingly important due to crew shortages, cost reduction, and safety considerations.In particular, the automation and autonomy of berthing maneuvers are significant issues because berthing maneuvering is one of the most stressful maneuvers seafarers undertake.
For a solution to the berthing control problem to be useful in practical situations, the solution must be generated in real time.However, this problem is challenging due to the computational constraints and the complexity and uncertainty of maneuverability.Therefore, the berthing control problem has often been tackled in two stages: trajectory planning and trajectory tracking.Here, in this paper, a trajectory is defined as a time series of state variables, which are subject to spatial and temporal constraints that take into account the vessel dynamics.The generation of a trajectory is called trajectory planning, and the tracking of a given trajectory is called trajectory tracking.In particular, the desired trajectory of berthing maneuvering is called the berthing trajectory.By preparing the berthing trajectory in advance, it is possible to solve the berthing control problem in real time only via tracking of the prepared trajectory.
However, tracking the predefined berthing trajectory is not straightforward because the vessel speed changes significantly, the effect of wind disturbance increases at lowspeed, and complex maneuvers, including backward and crabbing motion, are frequently required.Therefore, an essential task is to develop a tracking controller that can safely track a given berthing trajectory.

Related Research
Considerable research has been conducted on trajectory tracking control of surface vessels [1][2][3][4][5][6][7][8].For instance, there exists much research work on the application of nonlinear model predictive control to trajectory tracking control [4,7], and the development of tracking control based on backstepping control methods for robust control in situations where uncertainties due to disturbances exist [3,5,6].Studies have also been undertaken on dynamic positioning (DP) control, focusing on situations where the vessel speed is near zero [2].
Research on adopting reinforcement learning (RL) to surface vessel control has recently increased [9][10][11][12][13][14]. RL is an area of machine learning and can be used to find the optimal action maximizing a reward function in an environment containing uncertainty.Although RL requires training, RL does not require real time optimization computation because it uses function approximation via a neural network (NN).Martinsen et al. implemented an algorithm based on the deep deterministic policy gradient method to obtain a controller that minimizes the tracking errors in straight and curved paths in situations involving unknown currents [10,11].Subsequently, Martinsen et al. proposed an RL-based control scheme for trajectory tracking control that achieves both DP control and path following [12].
However, berthing trajectories typically consist of complex motions such as turning, stopping the ship, backward motion, spinning, and crabbing motion.Therefore, training for trajectories that consist of any combination of motions is required.
Additionally, in berthing maneuvers, the avoidance of collisions is also required.This is because a small tracking error can cause a collision in berthing maneuvers because the purpose of the berthing is to bring the vessel close to the desired berth, which in itself represents an obstacle.Moreover, it is impossible to keep tracking errors equal to zero in an uncertain environment.
Some studies on collision avoidance of surface vessels have already been undertaken [13,14].However, those studies treated path following and collision avoidance as dualobjective problem.In this case, path following and collision avoidance are in a trade-off relationship.The vessel may not be able to reach the berthing point due to avoiding the collision with the berth.Therefore, in berthing maneuvers, it is effective to avoid tracking errors that increase the probability of collisions.

Scope of this study
The purpose of this study is to develop a methodology that avoids the tracking errors that may cause collisions with static obstacles.This paper proposes a training method for a trajectory tracking controller that reduces the probability of collision with static obstacles.The main contributions of this paper are as follows: 1.The generation of random trajectories generated from maneuvering simulation was introduced to permit trajectory tracking controllers to track complex trajectories.2. The development of a method to generate static pseudoobstacles depending on the desired trajectory was pro- The proposed method requires a training environment including a maneuvering model and response characteristics of steering systems.However, it is unnecessary to prepare a large number of berthing trajectories and obstacle information for training; only berthing trajectory and obstacle information in the target port is required.
The remainder of the paper is organized as follows: Section 2 describes the training method of the trajectory tracking controller.Section 3 describes how the trajectory tracking controller can be applied to berthing maneuvers.Section 4 reports and analyzes the numerical simulation and model experiment results of berthing maneuvering.Finally, Section 5 presents the conclusions of the study.

Method
The subject ship in this study is a model ship shown in Fig. 1.The length and breadth of this model ship are defined as L and B, respectively.This ship is equipped with a single propeller, VecTwin rudders, and a bow thruster.The actuator states is defined as u u u ≡ δ p , δ s , n p , n bt T ∈ R 4 , where δ p represents the rudder angle on the port side, δ s represents the  rudder angle on the starboard side, n p is the propeller revolution number, and n bt is the bow thruster revolution number.In this study, the response characteristics of the steering rudder are taken into account.The control commands are defined as u u u cmd ≡ δ p,cmd , δ s,cmd , n p,cmd , n bt,cmd T ∈ R 4 .
The upper and lower limits of the actuator state variables are defined in Table 1.Note that in this study, the propeller revolution number is constant at 10 rps because the VecTwin rudder system enables various motions by changing the rudder angle while maintaining a constant propeller revolution number.
In this study, the motion of the vessel in the harbor is modeled by surge-sway-yaw three-degrees-freedom equation of motion.The coordinate systems are earth-fixed coordinates, represented by o 0 − x 0 y 0 and ship-fixed coordinates, represented by o − xy, which has its origin on midship.The coordinate systems in this study is shown in Fig. 2. Here, the vessel state vector is defined as x x x ≡ (x 0 , u, y 0 , v m , ψ, r) T ∈ R 6 , where ψ is the heading angle, u is the surge velocity, v m is the sway velocity of the midship, and r is the yaw angular velocity.For convenience, the vessel state vector x x x is divided into pose vector, η η η ≡ (x 0 , y 0 , ψ) T ∈ R 3 , and the velocity vector, v v v ≡ (u, v m , r) T ∈ R 3 .The observed vessel state vector is defined as x x x ∈ R 6 .The vector of true wind speed and direction is defined as w w w ≡ (U T , γ T ) T ∈ R 2 , where U T represents the true wind speed, and γ T is the true wind direction.In this study, the true wind speed and direction are assumed to depend on time but not space.

Optimization of the trajectory tracking controller
In this study, the trajectory tracking problem of minimizing the tracking error is formulated as a maximization problem related to a reward.This section describes the formulation used here.
This paper focuses on the trajectory tracking of the pose vector, η η η.Here, the desired pose vector is defined as r r r ∈ R 3 .
The time series of desired pose vector can be represented as follows: where k denotes discrete-time steps, and N is the total number of steps in the desired trajectory.The time of k-th step is defined as t k , and the time step between t k and t k+1 is defined as ∆t.The time step between successive decisions in the RL algorithm is also equal to ∆t.
In this study, the geometric information related to static obstacles is assumed to be given.The static obstacles are represented by multiple polygons which can be defined, where S i represents the i-th obstacle polygon, and N obs represents the total number of obstacle polygons.
In RL, the state variable for decision-making is often represented by s s s and the action variables by a a a; these conventions is adopted in this work.In this study, the trajectory tracking controller makes decisions depending on the observed vessel state, x x x, the actuator state, u u u, the desired trajectory, R, and the static obstacle, O. Thus, the state variable and the action variables at k-th step are defined as, where f f f s is a function that generate the input of the controller; this function is described in Section 2.3.The action can then be determined as a a a k = µ µ µ(s s s k ), where µ µ µ is the policy function.This function is the trajectory tracking controller, and is represented by a NN.This study considers trajectory tracking as an episodic task in RL.Thus, an objective function with a finite-time horizon can be defined as, where γ is the discount rate, and r(•) is the reward function, which depends on the tracking error, control effort, geometric relationship between vessel and obstacles, and the episode elapsed time, t k .The reward function r(•) is described in detail in Section 2.4.τ is a time series of the state and action variables throughout the episode and is defined as τ = (s s s 0 , a a a 0 , . . ., s s s N−1 , a a a N−1 , s s s N ).ρ µ µ µ can then be written as, where p(s s s 0 ) represents the probability distribution of the initial state, s s s 0 , observed from the environment; p(s s s k+1 |s s s k , a a a k ) represents the state transition probability distribution.These distributions were implemented in the numerical simulations presented in this work and are described in detail in Section 2.2.The objective function described in Eq. ( 4) represents the expected cumulative reward obtained in an uncertain environment.In this study, the trajectory tracking controller that maximizes this objective function is calculated.Although solving this optimization problem is difficult, various suitable optimization methods have been proposed in the area of artificial intelligence.In this study, the twin-delayed deep deterministic policy gradient (TD3) algorithm [15] was used as the optimization method.

Training Environment
This section describes the training environment used in this work.In training, the desired trajectory and the static obstacle are generated for each episode.After that, maneuvering simulations were conducted.The methods used for the generation of the desired trajectory and static obstacles are described in Section 2.2.1.The environment of maneuvering simulation is described in detail in Section 2.2.2.

Desired trajectory and pseudo-obstacle
In training, the desired trajectories, R, are generated stochastically for each episode, and the static pseudoobstacles, O, are generated depending on the desired trajectories.This section describes the generation methods used for these variables.
The manner in which the desired trajectories were generated is described here.The target of this research is to train the trajectory tracking controller for berthing maneuvers.It is reasonable to train on a set of desired berthing trajectories in the target harbor.However, collecting many berthing trajectories is a time-consuming task.Moreover, training based on a small number of specific trajectories is undesirable because it may be necessary to change the desired trajectory due to practical considerations.Therefore, in this paper, trajectories obtained from a maneuvering simulation subject to random control input are introduced.
The trajectories generated for training are required to include complex motions such as turning and backward motion.In this maneuvering simulation, control inputs were determined randomly at each time step of the simulation.The distribution followed by control inputs was determined so that the time average of the thrusts generated from actuators was around zero.Thus, the bow thruster revolution number was determined based on the uniform distribution over the interval listed in Table 1.The port and starboard side rudder angles were determined based on the normal distributions of N (75, 30 2 ) and N (−75, 30 2 ), respectively.Here, the mean values of these distributions were set to take the value at which the net forces and the moment approached nearly zero.
The initial state of the maneuvering simulation was determined randomly.In this study, the initial vessel velocity was determined based on the uniform distribution, whose interval is listed in Table 2.The initial vessel pose was set to zero.In this maneuvering simulation, wind disturbance was neglected.
Table 2: Initial velocity interval of the uniform distribution used to generate the desired trajectory.

Item Lower limit
Upper limit The generation method of the static obstacle is described.In this study, the static obstacles used in the training procedure were automatically generated according to the desired trajectory.Here, we impose that the desired trajectory does not contact any obstacle.We also defined that the generated obstacles consist of line segments that are all longer than the vessel length.A schematic of the proposed scheme is presented in Fig. 3 and can be summarized in the following steps: 1.A grid is set based on earth-fixed coordinates o 0 − x 0 y 0 .
Here, one of the grid points is set to coincide with the origin of the coordinate system used.The grid spacing of both the x-axis and y-axis is set to be 2L, and the number of grid elements was set such that it covered the region where the desired trajectory exists.2. The area where obstacles must not exist is then determined according to the desired trajectory.In this study, obstacles must not exist in the area through which the vessel passes when the motion of the desired trajectory is performed.Here, the vessel shape was represented by an ellipse whose semi-major axis was 0.75L and semiminor axis was B. Additionally, obstacles must not exist in a circular area of radius 1.9L centered at the origin of the coordinate system; this condition avoids collisions due to initial tracking errors.
Fig. 3: The generation method of static pseudo-obstacles used in this work.
3. All grids through which the vessel does not pass are defined as static obstacles.
An example of the generated trajectory and pseudoobstacles is shown in Fig. 4.
Fig. 4: An example of a desired trajectory with the static pseudo-obstacles used in training.

Maneuvering simulation
The simulation environment was implemented considering a low-speed maneuvering model, steering response characteristics, wind disturbance, and artificial noise; the inclusion of all these elements led to a realistic environment.This section describes the models used to generate each of these elements in detail.
First, the maneuvering model used in this study is described.The mathematical maneuvering group (MMG) model is widely used as a maneuvering model [16].In this study, a model based on the MMG model for low-speed maneuvering was used.The used MMG model consists of the following submodels: For the hull hydrodynamic forces, Yoshimura's unified model [17] was used.To model the forces induced by the propeller, the model used in the work of Kang [18] was used.For the forces induced by the rudder, Kang's model [18], which considers correlation effects between the port and starboard rudder, was applied.To model the forces induced by the bow thruster, Kobayashi's model [19], which considers the forward speed effect, was applied.The force induced by the wind were modeled using Fujiwara's regression model [20]; this model estimates the wind force coefficients from the hull and superstructure geometry.All model parameters were derived based on captive model tests with the exception of the wind force coefficients.The MMG model used here is denoted as follows: where the overdot, ˙, denotes the derivative with respect to time, t.
Second, the response characteristics of actuator state, u u u, are described.The rudder steering system with which the Table 3: Uniform distribution intervals of the initial tracking error from the berthing trajectory.

Item Lower limit
Upper limit subject model ship was equipped has the characteristic of approaching a command value at a constant speed.Therefore, in this study, the response characteristics of rudder steering were modeled using the following expressions, where K is the rudder steering speed and is determined to be 20 deg./s.This value is taken based on measurements of the actual model ship system.The response characteristics of the bow thruster revolution numbers were neglected.The bow thruster revolution numbers are given as n bt = n bt,cmd .
The wind process was generated using the method proposed by Maki et al. [21].In this method, one-dimensional filter equation using the wind speed spectrum approximated via Hino's spectrum [22] is numerically solved using Euler-Maruyama's method.This method can generate the wind process if the time-averaged true wind speed and direction are given.Here, the time-averaged true wind speed and direction are defined as ŪT and γ T , respectively.These variables are stochastically determined for each episode.
In this study, the maneuvering model described in Eq. ( 6) was solved numerically using the Runge-Kutta method, and the response characteristics described in Eq. ( 7) were solved analytically.Here, the time step of the numerical solutions is defined as ∆t sim ; this time step was set to be 0.1 s, which is shorter than the time step ∆t.Here, the initial value of the vessel state is given such that the tracking error with the desired trajectory follows a uniform distribution over the interval listed in Table 3.The initial value of the actuator state is given as (0.0, 0.0, 0.0, 0.0) Furthermore, process noise was added to the MMG model, and observation noise was added to the parameter describing the vessel state.The process noise and the observation noise were defined to follow a normal distribution, and their covariance matrices are represented by Σ Σ Σ sys ∈ R 6×6 and Σ Σ Σ obs ∈ R 6×6 , respectively.The covariance matrix of the observed noise was determined by the nominal observation accuracy, which was described in the equipment specifications.

Input of the controller
This section describes the function f f f s used in Eq. (3).In this study, the trajectory tracking controller makes decisions depending on the vessel state, tracking error, and geometric relationship between the vessel and the obstacles.In the following, these elements determining the state, s s s, are described.
The tracking error is defined as the error between the vessel pose vector, η η η, and the desired state, r r r; this value is based on ship-fixed coordinate system o − xy, and is represented as follows: e e e i, j = where i and j represent the time step of the desired state and the vessel pose, respectively.The geometric relationship between vessel and obstacle is defined as the error between vessel positions and the obstacle point nearest to the desired positions; this parameter can be written as, where η η η j is defined as the position component of η η η j and o o o near,i is defined as, where r r r i denotes the position component of r r r i .In this study, s s s k is defined to consist of the tracking error, e e e i, j , the distance to the static obstacle, ẽ e e i, j , the vessel velocity, v v v k and the actuator state, u u u k .s s s k is thus expressed as, where T 1 , T 2 , • • • represent arbitrary time steps that are assumed to be positive multiples of ∆t.Thus, this study also considers the error between the current vessel state and the future desired state as an the input of the controller.Therefore, the function f f f s is defined via the use of Eqs. ( 8) to (11)

Reward function
This section describes the reward function used in this study.
The reward function was designed to obtain a trajectory tracking controller that minimizes tracking errors and controls effort.Furthermore, we propose a method to preferentially minimize tracking errors that may lead to a collision with static obstacles.
Im ag in ar y O bs ta cl e Li ne D es .Tr aj .
Fig. 5: Illustration of the relationship between the vessel and static obstacle One measure of tracking error is the error norm of the vessel pose state, η η η, and desired state, r r r.However, this measure requires the time-consuming task of adjusting various weights due to the units being different for the information regarding position and heading angles.Therefore, in this study, the measures of the tracking error was defined based on the error of bow and stern positions that can be expressed as, where g g g bow ∈ R 2 and g g g stern ∈ R 2 denote the bow and stern positions, respectively; these variables are defined as, We also define the measure of the distance between the vessel and the static obstacles.Since the purpose of berthing is to get the vessel close to the berth, a measure that prevents the vessel from approaching the berthing point is undesirable.For this reason, a measure that has no trade-off between reducing tracking errors and avoiding the collision with the obstacles is defined here.
In this study, we introduced an imaginary obstacle line shown in Fig. 5; this line passes through o o o near,k , and the angle of inclination of this line is equal to the desired heading angle.The normal vector on the x 0 y 0 plane orthogonal to the imaginary obstacle line was also introduced.This normal vector is defined as n n n k ∈ R 2 , and can be calculated as according to, where e e e z is a unit vector orthogonal to the x 0 y 0 plane, and this vector was (0, 0, 1) T .Here, the distance between the position of the desired state and the imaginary obstacle line was expressed as, Besides, the distances from the actual bow and stern positions to the imaginary obstacle line were expressed as follows: Then, the indicators that increase as the vessel approaches the obstacles compared to the desired position can be expressed as l k − l bow,k and l k − l stern,k .Thus, the measures of nearness between the vessel and the static obstacles based on the desired position were defined as follows: Note that the cases where the vessel position is farther from the obstacles than the desired position were ignored since we focused on only reducing the tracking errors that increase the probability of collisions.Moreover, the tolerances of the measures described in Eqs. ( 12) and (17) were introduced.These tolerances are defined as, where e 0 and c 0 are the initial tolerances, e ∞ and c ∞ are the tolerances when the time of the episode elapses infinitely, and b e and b c are the attenuation factor of the tolerances.
If the measures given in Eqs. ( 12) and ( 17) exceed tolerances, no further rewards are given and the episode is stopped.In other words, the episode is terminated if one of the following inequalities is not satisfied: If a collision occurs, the episode is terminated.The collision detection area was considered to be equivalent to the area defined in Section 2.2.1.If this area contacts an obstacle, the episode is terminated.
The reward function can thus be defined as follows: where λ is a positive constant value, u c,i represents the control input with the lowest control effort, and u std,i is a constant that normalizes each variable.Here, if λ is large, the reward becomes negative.The controller may obtain more rewards when the episode is terminated early in such a case.For this reason, in this study, λ is assumed to be small compared to the other terms.

Application to berthing maneuvers
In this paper, the proposed methods were evaluated in terms of tracking the berthing trajectory.This section describes the generation method of the berthing trajectories and the development of tracking control methods for the berthing trajectories.The target port of this study is the Inukai pond, which exists at Osaka University.This port geometry was shown in Fig. 6.In this study, the desired trajectories for berthing maneuvers were generated according to the trajectory planning method proposed by Miyauchi et al. [23].In this algorithm, the trajectory was explored in the framework of optimal control theory [24,25] with the use of covariance matrix adaptation evolution strategy (CMA-ES) (see, for example, Ref. [26]).
Here, the obtained trajectory is defined as, where N is the total number of time steps.Note that the time step of the desired trajectory obtained may differ from the  decision-making time step ∆t defined in Eq. (1).The time at which the i-th step occurs is defined as t i , and the time step between t i and t i+1 is defined as ∆t .We prepared 44 trajectories with 4 different terminal conditions and 11 different initial conditions; these conditions are listed in Table 4.In the trajectory planning, the limits of actuator commands were changed from those listed in Table 1 to generate the trajectory with enough margins of control forces.The limit of the actuator states used is shown in Table 5.This idea has been proposed by Kose et al. [27].The obtained trajectories are shown in Fig. 7.

Selection of the desired state
In this study, we propose a method for the communication of the desired pose vector to the controller.In trajectory tracking, the desired state progresses depending on time to imitate the motion of the desired trajectory.However, this method may unnecessarily increase the vessel speed.In a berthing maneuver, the tracking control must not exceed the planned vessel speed.This is because the vessel may be unable to stop at the berthing point from a higher vessel speed.Therefore, the desired pose vector was chosen from the desired trajectory according to the vessel position: When the vessel pose is η η η k , the desired states are given by, where i k is defined as, where r r r represents only the coordinate component of r r r .I represents the search range and should be defined to avoid shortcuts when desired trajectories intersect.Note that T 1 , T 2 , • • • must also be positive multiples of ∆t .

Pseudo-obstacles in the target harbor
Tracking performance is degraded in situations that are different from those experienced in training due to the characteristics of NNs.The size of the space without the obstacles in the target port may differ from training.In this case, the scale of ẽ e e k,k , ẽ e e k+ T 1 ∆t ,k , ẽ e e k+ T 2 ∆t ,k , • • • included in Eq. ( 11) may be different within the training environment and the target port.Therefore, pseudo-obstacle was also generated in the target port to prevent tracking performance degradation.
The method described in Section 2.2.1 would lead to the generation of pseudo-obstacles that would significantly affect the berthing maneuvering.Therefore, the ellipse area that was used to represent the vessel shape was changed to a Fig. 8: A sample of the generated pseudo-obstacles in the target harbor.circular area whose radius is 1.9L.Since generated pseudoobstacles are dummy, collisions with pseudo-obstacles are ignored.An example generated by the method described here is shown in Fig. 8.

Results
This section presents the training results of the proposed method and the results related to the berthing maneuver.We present the tracking results of the berthing trajectories obtained from both simulation and model experiments.
In this study, for comparison of proposed methods, we prepared a trajectory tracking controller minimizing the tracking error and control effort in the absence of static obstacles.Here, ẽ e e k,k , ẽ e e k+ T 1 ∆t ,k , ẽ e e k+ T 2 ∆t ,k , • • • included in Eq. ( 11) are set to zero.The third and fourth terms of the proposed reward function described in Eq. ( 20) is also set to zero.This trajectory tracking controller is referred to as Ctrl-w/o-OBST, whereas the trajectory tracking controller obtained by the proposed method was Ctrl-w/-OBST.
The training results of Ctrl-w/o-OBST and Ctrl-w/-OBST are shown in Section 4.1, and the tracking results of the berthing trajectories using Ctrl-w/o-OBST and Ctrl-w/-OBST are shown in Section 4.2.

Training results of trajectory tracking controller
This section shows the training result of the proposed method.Ctrl-w/-OBST and Ctrl-w/o-OBST were trained five times each.The training was undertaken for a total of 3.0 × 10 7 s of simulation time.
The parameters used during training are summarized here.The environmental parameters are listed in Table 6, and the parameters describing the tracking controller and the NNs are listed in Tables 7 and 8, respectively.Here, the output dimension of the policy function is different from the

ŪT
given by weibull distribution whose shape and scale parameter are 2.0 and 1.0 γT given by uniform distribution whose interval is Table 7: Numerical values used for the parameters present in Eq. ( 11).dimension of the actuator command because the propeller revolution number is constant.The parameters of the objective function and the reward function are listed in Table 9.
Here, the values of e 0 are set to be almost twice as large as the initial tracking error.The values of b e and b c are set such that the allowed range is halved after 50 seconds.
The parameters of the NNs acquired during the training were saved and then evaluated for all parameters of the NNs.In the evaluation, 20 episodes were simulated in the same environment as was used in the training episodes, and the average cumulative rewards were calculated.The average cumulative rewards and the 95% confidence interval for the five training are shown in Fig. 9.We see that the results do not deviate significantly from the 95% confidence interval.Therefore, the parameter with the highest cumulative reward among the five training was selected as being the most appropriate method.Note that the cumulative rewards of Ctrl-w/-OBST and Ctrl-w/o-OBST cannot be compared directly as the reward functions are different.
Fig. 10: An example of a result obtained from the simulation of a berthing maneuver using the Ctrl-w/o-OBST algorithm.The terminal condition of the desired trajectory is η η η des,1 , and the average wind speed, ŪT , is 0.0 m/s.
Fig. 11: An example of results obtained from the simulation of a berthing maneuver using the Ctrl-w/-OBST algorithm.The terminal condition of the desired trajectory is η η η des,2 , and average wind speed, ŪT , is 0.0 m/s.

Tracking results related to the berthing trajectories
This section shows the tracking results related to berthing trajectories.Here, a shorter decision-making interval for the evaluation was set to ∆t = 1.0 s, the time step of the berthing trajectory was set to ∆t = 0.2 s, and the search range used in Eq. ( 23) was set to I = ∆t/∆t .The simulation and model experiment results are shown in the following sections.

Simulation
We compare the results obtained using Ctrl-w/o-OBST and Ctrl-w/o-OBST in terms of collision probability during the berthing maneuvers to demonstrate the effectiveness of the Fig. 12: An example of a result obtained from the simulation of a berthing maneuver using the Ctrl-w/-OBST algorithm.The terminal condition of the desired trajectory is η η η des,1 or η η η des,3 , and average wind speed, ŪT , is 1.0 m/s.
Fig. 13: An example of results obtained from the simulation of a berthing maneuver using the Ctrl-w/-OBST algorithm.The terminal condition of the desired trajectory is η η η des,2 or η η η des,4 , and average wind speed, ŪT , is 1.0 m/s.method proposed here.The collision probability is defined here as the probability of colliding with a static obstacle before the end of a given simulation.
In this study, 100 trials of berthing maneuvers were conducted to calculate the collision probability.The simulation was terminated if a collision occurred or the elapsed simulation time reached 250 s.The collision detection was conducted using the detection area defined in Section 2.2.1.The initial value of the vessel state for each trial is given such that the tracking error with the given berthing trajectory follows a uniform distribution over the interval listed in Table 5. Collision probabilities were calculated for a given average wind speed to investigate the dependence of the collision probability on wind disturbance.Here, considering the wind (a) Trajectories and the time series of control inputs  pressure and the upper limit of the thrust that the subject ship can generate in bollard conditions, the collision probabilities were calculated with average wind speeds below 1.5 m/s.The obtained collision probabilities are listed in Table 10.
In the case where the Ctrl-w/o-OBST was used, we see that the collision probability is relatively high in the berthing trajectories whose terminal conditions are η η η des,1 or η η η des,3 , whereas the collision probability is low for the berthing trajectories whose terminal conditions are η η η des,2 or η η η des,4 .To investigate further the tracking performance of the Ctrl-w/o-OBST method, a tracking result without the initial tracking error is shown in Fig. 10.This result shows that the slight heading error causes a collision near the berthing point.Therefore, in berthing maneuvers, the collision probability (a) Trajectories and the time series of control commands  increases due to small tracking errors even if the performance of the tracking controller is high.
In the case where the Ctrl-w/-OBST was used, we see that the collision probability is relatively high in the berthing trajectories whose terminal conditions are η η η des,2 .This is because, as shown in Fig. 11, there are many cases in which the collision detection area on the bow side contacts the obstacle during turning.A slight tracking error causes a collision during turning because these berthing trajectories are generated considering the collision detection area described in Section 2.2.1.The collision probabilities were therefore recalculated with a semi-major axis of the collision detection ellipse of 0.5L.The obtained collision probabilities are listed in the brackets in Table 10.It was found under these conditions that the collision probability is similar to those obtained for the other terminal conditions.
Comparing the results obtained using Ctrl-w/-OBST and Ctrl-w/o-OBST, we can see that the collision probabilities of Ctrl-w/-OBST are smaller than those obtained for Ctrlw/o-OBST, except when the terminal condition is η η η des,2 .In the case where the terminal condition is η η η des,2 , the collision probability for Ctrl-w/-OBST is comparable to those obtained for other terminal conditions.This indicates that the proposed method preferentially avoids tracking errors that lead to collisions with obstacles, and that it is effective for berthing maneuvers.
However, although the proposed method reduced the collision probability, it remains too high to be applicable to practical use.This study evaluated only whether a collision would occur during 250 s and did not consider the effects of mooring time and fenders on the berth.Therefore, more detailed studies with a definition of successful berthing operations and the development of controllers that meet their safety requirements are needed.
Finally, samples of the tracking results obtained using the Ctrl-w/-OBST are shown in Figs. 12 and 13.In these samples, the initial error in the y 0 -axis was 3.0 m, the average wind speed was 1.0m/s, and the average wind direction was toward a berth at the berthing point.These results show that the controller was able to deal with an initial tracking error of about L m.These results show that the same controller could track the desired trajectories with four types of terminal conditions.

Model experiment
In this study, the Ctrl-w/-OBST was evaluated via model experiments.In the model experiments, the berthing trajectories with terminal conditions of η η η des,1 and η η η des,2 were used.The tracking results related to these berthing trajectories are shown in Figs. 14 and 15, respectively.
In the case of Fig. 14, the wind speed increased as the model ship approached the berthing point, and reached 2.0 m/s.As a result, the wind disturbance caused the model ship to deviate from the desired trajectory, although the model ship reached the berthing point without collision.The probability of a collision would increase if the wind direction was different.Therefore, in this wind speed, a decision to stop the berthing maneuvers is required.Further research into the wind speed that permits safe berthing maneuvers is required.On the other hand, from the point of view of tracking control, this result shows that the controller is able to deal with tracking errors even during low-speed maneuvering.
Fig. 15 shows the results of the model experiment conducted under relatively calm conditions.This result shows that the Ctrl-w/-OBST could track a berthing trajectory, in-cluding turning, and reduce the vessel speed to a value close to the allowable velocity, v v v tol , near the berthing point.

Conclusion
This paper proposed a training method based on RL for a trajectory tracking controller that reduces the probability of collision with static obstacles.This paper summarized the application of the obtained trajectory tracking controller to berthing maneuvers.To demonstrate the effectiveness of the proposed method, this paper showed the results of both simulations and model experiments related to the tracking of berthing trajectories in a target harbor.This work also showed that the proposed method reduces the probability of collision during in berthing maneuvers.Furthermore, this paper showed that the controller obtained by the proposed method was capable of following berthing trajectories and reducing the speed to a value close to the allowable velocity near the berthing point while avoiding collisions.
In the proposed method, the controller was trained using obstacles consisting of line segments longer than the vessel length.However, actual harbors may have more complex shapes.Therefore, research related to training methods for obstacles consisting of shorter line segments is necessary.In this study, the controller was evaluated by considering the probability of collisions.However, collisions below a certain speed may be acceptable due to the presence of fenders.Therefore, further research on the definition of successful berthing is also necessary.Additionally, a decision to stop the berthing maneuvers is required for autonomous berthing when the effect of the wind disturbance is significant.Solving those problems is one of our future works.

Fig. 2 :
Fig. 2: The coordinates systems used in this work.
s s k = e e e T k,k , e e e T

Fig. 9 :
Fig. 9: Average cumulative reward of the 20 episodes used for the evaluation.
(b) The time series of the pose and velocity vector.
(b) The time series of the vessel pose and velocity.

Table 4 :
Initial and terminal conditions, as well as tolerances used in trajectory planning.

Table 5 :
Limit of actuator state variables used in trajectory planning

Table 6 :
Parameters describing the training environment.

Table 8 :
Details of procedure used in the NNs.Values in bracket express the inputs into Ctrl-w/o-OBST.

Table 10 :
The collision probability during berthing for each wind speed and terminal conditions.