1 Introduction

When navigating a modern ship, several navigation tools are used. This includes traditional tools such as charts and compass and measures of the ship’s course and speed, as well as more advanced tools such as AIS, radar and ECDIS. Decision support tools which utilize artificial intelligence are also being developed, tested and used. In addition to gathering and displaying information, these tools make more or less intelligent decisions, combining information from different sensors and information sources to for example detect, track and classify a target, and determine its position, course and speed. The implementation of such functionality is sometimes aimed towards helping the human operator to enhance her situational awareness. Some systems also provide suggestions for deviating the planned route by changing speed and/or course. Autonomous navigation systems can include creating suggestions for new paths to avoid known static obstacles or areas known to often be congested, as well as avoiding immediate collisions in close range [1]. Evasive maneuvers can be presented as an overlay on radar or ECDIS with waypoints, or by combining arrows and animations with “real” video or a simulator-generated environment. Route deviations proposed by the system can be displayed to the operator for information only, or the evasive maneuvers can be automatically executed by the system, enabling the ship to navigate on its own, without human involvement.

The focus in this study is on ships that are navigated autonomously, supervised by human operators. The idea in such ship operations is that a human operator that supervises the operations should intervene if (and only if) necessary and take manual control of the ship, for example from a remote operation center (ROC). We do, however, expected that this can be demanding for a human operator. Hence, the main objectives of this paper are (1) to propose a simulator-based methodology for investigating and quantifying the operators’ performance and ability in intervening when (and only when) appropriate, and (2) to demonstrate the methodology on a set of real traffic scenarios. In addition, we provide a detailed description of the traffic scenarios, including proposed collision avoidance maneuvers assessed by three independent experts. We hope this dataset can prove to be valuable for future studies, see “Appendix” for details.

In Sect. 2, we provide more context and background to the study. A detailed description of the proposed methodology is presented in Sect. 3. Section 4 presents our demonstration and results. Finally, a discussion and concluding remarks are offered in Sects. 5 and 6.

2 Background

An autonomous navigation system (ANS) should ideally only propose safe maneuvers, and the situational awareness conveyed by the system should be correct and complete. Unfortunately, also decision support functions can, if not carefully designed, distract the operator and contribute to accidents. To ensure safety, any decision support system or autonomous navigation system must be rigorously tested, and the system’s limitations, uncertainties and capabilities must be correctly conveyed to its users [2]. The verification of autonomous navigation systems is fundamentally different from a traditional verification process based on physical understanding and theory [3], and due to the large input space of real-world scenarios, unguided testing cannot hope to cover more than a tiny fraction of unusual, but still possible, scenarios [4, 5]. As maritime navigation is a complex and complicated activity, it can be difficult for a computer to interpret the rules of the International Regulations for Preventing Collisions at Sea (COLREG), and it is very challenging to guarantee that a system will operate adequately in every situation.

Therefore, due to human’s ability to learn and adapt to new and unexpected situations, humans can play an important role in such complex technological systems to ensure safe and efficient operation [6]. One potential solution can be to let a human operator supervise the situation, and intervene when necessary. With new functionality and increasing autonomy, it can, however, be challenging to keep the human operators in the information loop [7, 8] and this can lead to poor situational awareness. If the operator does not understand the systems’ reasoning, this can make it difficult to decide whether or not to intervene. Boredom can be another challenge in a supervision context [8]. This can be avoided if the system is capable of notifying the supervisor before it is entering a challenging situation. In our experiment, once the system detects a challenging situation, it will notify the human operator. In a defined time window after the notification, the system will not perform any action. This is to allow some time for the operator to assess the situation before an action is initiated automatically by the system. It is difficult to predict a future traffic situation, especially in the far future. If the system should give notification very early, it will be less accurate, leading to many false alarms and/or missed alarms. In order to achieve both adequate accuracy and time for the operator to assess the situation, we decided to use a time-window of one to two minutes in our experiments.

Fig. 1
figure 1

Decision support system. Two alternative suggested evasive maneuvers: (1) course change, (2) speed change

2.1 Human–machine interface

Information from support systems can be presented in different ways to humans on board the ship or in a remote operation center (ROC). Video displays or heads-up displays can be used, highlighting important areas or targets for example with bounding boxes, markers or image segmentation. The information can also be used to produce an augmented electronic chart giving the operator’s a birds-eye view. Nonetheless, when investigating the use of such augmented reality (AR) technology in the field of maritime navigation, Laera et al. [9] report that “primitive information such as course, compass degrees, boat speed and geographic coordinates continue to be fundamental information to be represented even with AR maritime solutions”.

In most traffic situations, there exists multiple viable solutions, and the technology provider must choose if the system should display many different alternative solutions, or only the one considered most optimal. As an example, the system illustrated in Fig. 1 provides two alternatives: (1) speed change only, and (2) course change only. Another challenge is determining which objects are relevant to display and highlight. If the system detects a large number of ships at berth, should they all be highlighted or marked with bounding boxes? Detecting a small buoy can be difficult for a human operator, and hence a system which highlights it is useful. But what if the system fails to detect a large ship nearby or a large bridge pillar, will this affect the safety of the operation?

2.2 Degrees of autonomy

The physical location of the operator, if she is physically present on the ship or if she is located remotely is decisive when the Maritime Safety Committee (MSC) of the International Maritime Organization (IMO) [10] organize autonomy in four degrees: (1) crewed ship with automated processes and decision support, (2) remotely controlled ship with seafarers on board, (3) remotely controlled ship without seafarers on board, and (4) fully autonomous ship. In this study, however, our focus is on the type of tasks the human operator shall embark when operating or supervising a ship, with no emphasis on the physical location of the human operator. We, therefore, find the categorization presented in Fig. 2 to be useful. This categorization distinguishes between ships that are actively navigated by humans, with access to traditional navigation tools only (Fig. 2a) or sophisticated functionality and decision support tools (Fig. 2b), and ships that are navigated autonomously, with human involvement (Fig. 2c) or without human involvement (Fig. 2d). Whether the human operator is physically present onboard the ship, or if she supervises the operation from an onshore location is not taken into consideration in this categorization.

Fig. 2
figure 2

Levels of autonomy. Alternative categorization focusing on the responsibility of the human operator

2.3 Testing

Quantifying the performance of a complicated and complex task such as marine navigation is inherently difficult [2]. Safe navigation is a task which lacks a clear specification [11], and the International Convention on Standards of Training, Certification and Watchkeeping of Seafarers (STCW) neither provides sufficient detail for all the necessary competencies for safe navigation nor the methods for assessing them [12]. One possibility is to let trained domain experts perform an experience-based assessment. If we assess the navigation quality of ships that are manually navigated (category A and B), each realization of a scenario will have identical initial conditions, but depending on the actions of the navigator, the traffic situation can be very different a few minutes into the scenario. In this study we concentrate on ships in category C, ships that are navigated autonomously, supervised by human operator. This makes it is possible to compare multiple realizations of identical scenarios. Furthermore, this experimental setup also allows us to compare different human–machine interfaces, for example to investigate if providing more information about the systems’ reasoning increases the performance of the human operator.

3 Methodology

We propose a simulator-based approach for testing and assessing human operators’ abilities and performances in supervising an autonomously navigated ship. A set of experienced deck officers are assigned the same set of different scenarios, in random order. Prior to entering a scenario, the candidates are provided general information about the scenario.

When the ship enters a challenging situation, the human operator (the test subject) is notified, and enters the ship bridge. The operator is immediately informed that the autonomous navigation system has detected a potential conflicting situation. The system also provides information about how it will solve the situation, and communicates an evasive maneuver. The human operators will need to quickly assess a traffic situation and a proposed collision avoidance maneuver, and after a short period of time (one to two minutes) make a decision whether or not to intervene. A summary of the experiment process is shown in Fig. 3. Each candidate will be tested in multiple scenarios.

Fig. 3
figure 3

Summary of the experiment process. *Results regarding the participants’ trust in the system, based on the questionnaires, are presented in Madsen et al. [13]

In this setup, it is not given that the operator will have sufficient time to attain adequate situational awareness [14]. Still, such situations can be realistic whenever the operator controls multiple ships (as can be the case in a remote operating center (ROC)) or must perform additional tasks (as can be the situation in a ROC as well as on manned ships). Note that in some of the scenarios, the autonomous navigation system detects a potential conflicting situation, but it decides to maintain course and speed. This can for example be in situations where the own-ship is the stand-on vessel, and the other ships should perform evasive maneuvers.

3.1 When is an intervention needed?

When designing or selecting test scenarios, it should be decided if an intervention is needed, and when the intervention must be initiated. As marine navigation is a complex and complicated task which lacks a clear specification [11], this can, however, be challenging. To conclude on a correct action (to intervene or not), we let multiple experts assess the scenarios independently. These experts should be provided with all relevant information about the scenario, for example including the possibility to replay the scenario. Nonetheless, it can be challenging to decide whether or not an intervention is needed or appropriate in a given situation. Perhaps, for example, a subtle change in speed or course is favorable, but not really necessary. If the experts disagree on whether or not an intervention is appropriate, the scenario should not be included in the study.

In a subset of the simulator scenarios, the correct behavior of the human operator is to intervene to avoid accidents or dangerous maneuvers, while in the remaining scenarios an intervention is not needed. The ratio of scenarios where intervention is needed to the total number of scenarios can be interpreted as a proxy of the accuracy of the autonomous navigation system. If for example the human operator is supervising multiple ships, it is important to not intervene when this is not necessary. Furthermore, in some of the scenarios, the proposed maneuver is indeed a good maneuver, and it will be very difficult to solve the situation in a better way.

3.2 Test statistics

We execute one realization for each candidate in each scenario. We measure the time of intervention, and perform a structured interview aiming to determine the reasoning for the candidates’ actions. The results are compared to the experts’ assessment. This lets us determine whether the human operator intervenes in due time when needed (True positive), fails to intervene in due time when needed (false negative), intervenes when not necessary (false positive), or avoids intervening when not necessary (true negative).

To communicate our results, we calculate the sensitivity (true-positive rate: TPR) and the specificity (true-negative rate: TNR). The sensitivity describes the probability of a candidate intervening when this was appropriate or needed, and the specificity is the probability of not intervening when this was not appropriate. These metrics are calculated as follows:

$$\begin{aligned} {\texttt {sensitivity}} ={\textrm{TPR}}= \frac{{\textrm{TP}}}{{\textrm{TP}}+{\textrm{FN}}}, \end{aligned}$$
(1)

and

$$\begin{aligned} {\texttt {specificity}} ={\textrm{TNR}}= \frac{{\textrm{TN}}}{{\textrm{TN}}+{\textrm{FP}}}. \end{aligned}$$
(2)

Generally, we aim for both high sensitivity and high specificity. In this particular situation, it is perhaps reasonable to argue that sensitivity is most important, that is to achieve high probability that a candidate intervenes when needed (or inversely stated; to reduce the probability of not intervening when needed). However, a candidate who intervenes all the time will achieve maximum sensitivity. This will often lead to very low specificity. A combined metric that awards both sensitivity and specificity is the following accuracy metric:

$$\begin{aligned} {\texttt {accuracy}}={\textrm{ACC}}= \frac{{\textrm{TP}}+{\textrm{TN}}}{{\textrm{TP}}+{\textrm{FP}}+{\textrm{TN}}+{\textrm{FP}}}. \end{aligned}$$
(3)

3.3 Predefined suggested maneuvers

The experiments are conducted in a simulator environment to ensure that scenarios can be conducted repeatedly [15]. The experimental setup presented in this paper can be utilized to test systems that automatically generate evasive maneuvers. Evasive maneuvers can also be predefined manually, which is the case in our demonstration. Hence, we do not implement an online autonomous navigation system which bases its decision on input from sensor data. Due to this, it is sufficient that the simulator we utilize replicate the real-world with adequate detail for a human navigator, an assumption which is not necessarily valid if the autonomous navigation system bases its decisions on input from the simulator [16]. Online autonomous navigation systems can also be utilized, conditioned on a sufficient interface between the simulator and the autonomous navigation system.

Fig. 4
figure 4

Kongsberg full mission navigator simulator. (1) ECDIS with system proposed maneuver, (2) ECDIS with planned route, (3) conning display, (4) binoculars for visual orientation, (5) radar

4 Demonstration and results

For our demonstration of the proposed methodology, we conducted experiments in a Kongsberg full mission navigation simulator, see Fig. 4. A set of \(n=7\) second-year deck officers are assigned the same set of \(m=8\) different scenarios, in random order. In the experiments, the human operator is aware of the operation area of the ship, that is Storfjorden, a fjord in the north-western part of Norway. All scenarios take place in the same part of the fjord, an area where the fjord is between three and four nautical miles wide. The own-ship is a 80 m container feeder, with good maneuverability, steered by azimuth thrusters.

Whenever the ship enters a challenging situation, the human operator is notified and operator enters the ship bridge. The system attempts to resolve the situation with an evasive maneuver, but in some of the scenarios, the system propose dangerous or sub-optimal solutions. The role of the human operator is, therefore, to closely monitor the system and intervene if necessary. The operator is informed about when the evasive maneuver will start; in our experiments this is between 1 and 2 min after the operator is notified.

The operator gains insight into the system’s planned maneuver through what is called the human–machine interface (HMI). In our experiments, the HMI consists of a chart where both the original route and the system’s proposed deviation is shown. In practice, this is implemented in the simulator by displaying waypoints on an additional ECDIS-display, and using track mode on the autopilot. Figure 5 shows an example of the HMI. In this example, the own-ship is sailing west (shown as a straight filled line), but because the own-ship should give way to the southbound target ship, a starboard maneuver is proposed by the system. The proposed maneuver is shown with a dotted red line drawn through a set of waypoints (red circles).

Fig. 5
figure 5

Human–machine interface: illustrates how the system’s planned evasive maneuver is communicated to the human operator

4.1 Scenario description

The 8 scenarios are based on real situations, selected from a set of 1010 ferry crossings in Storfjorden in the north-western part of Norway. All selected scenarios include challenging situations, and in the real-world multiple of these scenarios resulted in deviating COLREGS. See [17, 18] for a detailed description of that dataset.

Fig. 6
figure 6

Illustration of Scenario 1. The position of the own-ship is shown in double black circles. The proposed route is shown in red dashed lines. AIS symbols (green triangles) depicts target ships, and the dashed green lines shows their direction. The length of these lines indicate the target ship’s speed. See “Appendix” for description of the full set of scenarios

Brief descriptions of all the included scenarios are provided in “Appendix”. Illustrations showing how the system use a chart to present its proposed evasive maneuver to the operator are also provided. In all of the scenarios, the northbound and the southbound ships are coastal ferries, with departures every 30 min. The ferry routes are marked in the nautical chart. An example of Scenario 1 is shown in Fig. 6. In this scenario, the own-ship is in a crossing situation, and it should give way to the target on the starboard side (northbound target). There is also a target approaching on the port side, which should give way to the own-ship. There is no risk of collision with the westbound ship. The systems decision comply with the COLREG, and pass astern of the northbound target. Therefore, the correct action by the human operator is to not intervene in this scenario.

Table 1 Interventions per candidate, 0: no intervention, 1: intervention

4.2 Interventions

With \(n=7\) candidates performing \(m=8\) different scenarios, we conducted a total of 56 experiments. Table 1 shows whether or not the candidates intervened in the particular scenarios. It is notable that some candidates tended to intervene more frequently than others; candidate 1 intervened in all but one scenario, while candidate 4 only intervened twice. To quantify the performance, we compare the candidates actions with the experts’ solutions. The results are provided in Table 2, showing the number of

  • True positives (TP) candidate intervenes in due time when necessary,

  • False positives (FP) candidate intervenes when not necessary,

  • True negatives (TN) candidate avoids intervening when not necessary, and

  • False negatives (FN) candidate fails to intervene in due time when necessary.

The sensitivity, specificity and accuracy are shown in column 6–8 in the table. By coincidence, the number of TPs equals the number of TNs and the number of FPs equals the number of FNs. Therefore, in our results, the TPR, TNR and accuracy are identical.

Table 2 Summary of the results

4.3 Performance

Table 3 Results for each scenario

Table 3 shows the results per scenario. We observe a large spread in the accuracy of the different scenarios. For example, in Scenario 8, the accuracy is 1 because all the candidates intervened, and this was needed. In Scenario 5, however, 5 candidates intervened although this was not needed according to the experts, leading to an accuracy of 0.29. We should, however, note that in this scenario, one of the target ships, a ferry, deviate COLREG rule 15 (crossing situation) to give way to a larger cruise vessel although the ferry was the stand-on vessel in the particular situation.

The sensitivity (true-positive rate (TPR)), the specificity (true-negative rate (TNR)) and the accuracy (ACC) achieved by the seven different candidates are shown in Fig. 7. Candidate 7 achieves the highest accuracy, with high TPR and TNR. Candidate 1 achieves the highest possible TPR. Note, however, that if a candidate intervenes in every scenario, the candidate will always achieve a sensitivity of 1, which is the maximum value (as sensitivity is the probability if intervening when appropriate). But intervening in all scenarios leads to poor specificity. This is illustrated by candidate 1 who intervenes in 7 out of 8 scenarios. This candidate achieves a sensitivity of 1, but the specificity and accuracy of this candidate is low relative to some of the other candidates.

Fig. 7
figure 7

Average performance per candidate

4.4 Scenario complexity

After the experiments, the candidates were asked to what degree they understand the reasoning for the suggestions proposed by the system. We might suspect that candidates who feel they do not understand the reasoning of the system would intervene more frequently. However, Fig. 8 shows no clear effect of this. Neither do we observe any clear connection between the self-reported understanding of system’s reasoning and the accuracy achieved by the candidates, see Fig. 9.

Fig. 8
figure 8

Self-reported understanding and number of interventions

Fig. 9
figure 9

Self-reported understanding of system’s reasoning versus accuracy

In non-complex scenarios, it is easy to “understand” the systems’ reasoning by assessing its plan. For example in the crossing scenario shown in Fig. 1 with one stand-on vessel, a proposed starboard maneuver will be interpreted by the human operator (of the own-ship) as follows: (1) the system has detected the target ship, (2) the system has categorized this as a give-way situation, (3) the system will make a starboard turn to give way. Alternatively, the own-ship could have increased its speed in point three. This would have been interpreted in the same way: (1) the system has detected the target ship, (2) the system has categorized this as a give-way situation, and (3) the system will increase its speed to give way.

In more complicated scenarios, the maneuver proposed by system can be difficult to interpret. In such situations it would be beneficial if the system was able to convey its reasoning to enable the human operator to intervene when and only when needed. Our results indicate that when the decision of the autonomous navigation system is difficult to explain, the candidates tend to intervene to manually control the operation. Two of the times when a candidate unnecessarily intervened in scenario 5, the intervention of the candidate lead to a close quarter situation. Interviews with the candidates, post-experiment, revealed that the candidates were unaware of the close quarter situation, and believed their actions and their situational awareness were adequate.

5 Future research

5.1 Identifying the strengths and weaknesses

To determine whether an intervention is appropriate and needed, candidates need to understand both the traffic situation and the system’s planned actions. In scenarios with low complexity, both tasks are generally straightforward. However, as complexity increases, both interpreting the traffic situation and interpreting the system’s proposed solution becomes difficult. For a collaboration to function effectively, it is crucial to leverage each other’s strengths and be aware of each other’s weaknesses. This holds true for human–machine collaboration as well. Future research should focus on enhancing systems to make fewer errors, but it is also important to identify the machine’s weaknesses so that humans can become better in their role as supervisors. The introduction of new intelligent navigation tools necessitates new competency standards for those who will use them or supervise their use. However, research is needed to identify what these competencies comprise.

5.2 Conveying the system’s plan to the operator

Both current and future systems for autonomous navigation may make us artificial intelligence methods in order to improve the system’s performance [19]. Unfortunately, however, this can make it difficult to ensure that the proposed solution is easily interpretable by the human operator [20]. In our demonstration, the operator received what we consider minimal information about what the system plans to do. The operator only received information about how the ship would deviate from the originally planned sailing pattern. In future experiments, this human–machine interface can be changed to enhance decision transparency. More information about the reasoning of the autonomous navigation system can be included. Scenarios can for example be generated such that the autonomous navigation system is not aware of all relevant targets. This is to mimic the likely situation where the object detection system fails to detect all relevant targets. This allows us to investigate if it is important and useful for the human operator to be aware of which targets the system considers. We can for example execute two different realizations of each scenario, where we vary the human operator’s access to the reasoning and intentions of the autonomous navigation system. In both realization, the operator will be presented a chart showing a proposed evasive maneuver based on the autonomous navigation system. In the second realization, the chart additionally shows which targets the system takes into consideration. Now, the accuracy achieved with and without target identification can be compared.

Another possibility is to enable the operator to view a simulation showing how the system “believes” the future will evolve. In an experimental setting, this can be done by fast forwarding a video-recording of the scenario, without human intervention. This will give the human operator more insight into what the system believes will happen if the human operator does not intervene. This can help the human operator to better understand the reasoning of the system. A similar approach as the one explained above can be utilized to assess and compare the effect of this additional information on achieved accuracy.

5.3 Time to action

In our experiments, we have used a time-window of one to two minutes between the time when the operator is notified and the time when the system performs an action. Experiments with both longer and shorter time windows can be interesting topics for future research. It can for example be interesting to investigate how operators will react to very short time-windows. In our experiments, we noticed that many candidates intervened after a short time, the shortest was 15 s.

6 Conclusion

In this paper, we have explored a simulator-based approach for testing the human–machine collaboration required to enable supervised ship autonomy. The presented method is aimed at assessing human operators’ ability to quickly evaluate a traffic situation and make a decision whether or not to intervene after a brief time period. The results of our experiments indicate that candidates are able to successfully perform this task in low-complexity situations. However, as expected, performance declines in more complex traffic scenarios. When comparing the results of the different scenarios, we observe, not surprisingly, that there is a large spread in accuracy between the scenarios, indicating that the results are highly dependent on the scenario design. We also observe that the seven different candidates acted very differently, some preferred to intervene very quickly, perhaps out of mistrust to the system, while others may place too much reliance on the tool and intervened rarely.

Following the experiments, we conducted interviews with the candidates. Among other questions, they were asked to rank their perceived understanding of the system’s proposals. Somewhat surprisingly, we found no correlation between self-reported “understanding of the reasoning of the system” and “accuracy.” This suggests that it may be difficult for humans to recognize when to trust such decision support tools and when to override or supervise them. This also indicates that presenting poor solutions to the user might increase the risk of accidents in certain situations. This must be considered in regulatory development and approval processes. User tests may be useful for approval purposes, but should not be sufficient. Another interesting observation is that several candidates who intervened unnecessarily, explained that the reason for their intervention was that they wanted to execute the maneuver as quickly as possible. This might be appropriate in many situations (such as when yielding to another vessel), and demonstrates the potential benefit of autonomous navigation systems incorporating user communication capabilities. This way, once a proposed maneuver is evaluated and approved by the user, it can be executed promptly.

It is important to note that this study only considers a small number of scenarios and a small number of candidates. Hence, we encourage further research to explore the above mentioned relationships. In addition, while the candidates were accustomed to conducting experiments in simulators, collaborating with the system was new to them. We cannot rule out the possibility that the results would change if the candidates gained more experience in cooperating with the system.

Finally, this study clearly demonstrates that it is not a trivial task for a human operator to intervene and override an autonomous system, particularly under complex conditions. Hence, we strongly encourage the future development of new methods and tools for explaining the systems decision. This may be a way to increase the human operators’ capabilities. It is important to remember that this study primarily focused on difficult scenarios. Much of a ship voyage is considerably less complex than the situations studied here. As demonstrated in our study, in the low-complexity operations, the human–machine collaboration seems to work rather well. If we, through this and future studies, can better identify situations where humans can intervene adequately, this could form the basis for numerous applications in autonomous maritime transport, including remotely controlled unmanned vessels in limited geographical areas and periodically and conditionally unmanned bridge.