Introduction

Cardiovascular (CV) diseases are the most common cause of death across Europe, accounting for more than 4 million deaths each year, with cerebrovascular disease accounting for 25.4% of CV-related mortality across all ages and genders [1]. Mechanical thrombectomy (MT) has become an established treatment for patients with acute ischaemic stroke due to large vessel occlusion, increasing the likelihood that a patient will be functionally independent after a stroke [2,3,4,5]. During such a procedure, an operator navigates a guidewire and guide catheter from an insertion point (typically the common femoral artery) through the common iliac artery and descending aorta to the aortic arch. Subsequently, a guide catheter is typically advanced over a guidewire through the aortic arch to the distal aspect of the cervical segment of the internal carotid artery (ICA) (with or without the help of a slip catheter). Once the guide catheter is positioned in the distal aspect of the cervical segment of the ICA, it serves as an anchor for a subsequent step. The subsequent step is often the navigation of a microguidewire and microcatheter through the distal aspect of the cervical segment of the ICA to the site of the thrombus which is typically the M1 segment of the middle cerebral artery (MCA). Once the target site has been reached, a mesh-like device called a stent retriever is pushed through the thrombus within a microcatheter, expanded to engage the thrombus, and then pulled backwards, which removes the thrombus from the blood vessel. Alternatively, an aspiration catheter—where a vacuum sucks the thrombus from the artery—is used instead of a stent retriever.

In acute ischaemic stroke, time from symptom onset to treatment is crucial, as the benefits of MT become more marked the sooner a thrombus is removed [6]. As a result, in the UK for example, only 3.1% of stroke admissions benefit from MT despite at least 10% of patients eligible for treatment [7, 8]. Other challenges for MT relate to occasional complications, including perforation, thrombosis and dissection in the parent artery, and distal embolization of thrombus [9]. Moreover, angiography requires intravascular contrast agent administration, which can occasionally lead to nephrotoxicity [10]. For operators and their teams, the high cumulative dose of x-ray radiation from angiography is a risk factor for cancer and cataracts [11]. Although exposure can be minimized with current radiation protection practice, some measures involve operators wearing heavy protective equipment, a risk factor for orthopaedic complications, so alternative exposure reduction methods are beneficial [12, 13].

It is hoped that robotic surgical systems can mitigate or eliminate some of the challenges that MT presents. For example, robotic systems could be set up in hospitals nationwide and tele-operated remotely from a central location, increasing the speed of access to treatments such as MT beyond what is currently possible [14]. Additionally, robotic systems might eliminate any operator physiological tremors or fatigue and allow MT to be performed in an optimum ergonomic position while potentially increasing procedural precision (for example, procedure time), thereby improving overall performance and reducing complication rates [15]. Furthermore, as operators would not be required to stand next to the patient, their radiation exposure would be reduced, and the need to wear heavy protective equipment would be eliminated.

Robotic systems such as the Magellan\(^{\textrm{TM}}\) system (Auris Health, Redwood City, USA) and the Corpath GRX® (Corindus Vascular Robotics, USA) have been used to help alleviate some of the challenges of MT; however, they have limitations. The controller operator structure requires a reasonably high cognitive workload and can still result in human error which means that the procedure is limited to an individual operator’s skill set in terms of both MT and robot handling [16]. These robotic systems consist of user interfaces such as buttons and joysticks, requiring skills different from those used in current clinical practice [17]. Additionally, a lack of haptic feedback from robotic systems results in difficulty receiving tactile feedback from the catheters and guidewires as they interact with vessel walls [14].

One emerging method of mitigating some of these challenges is by applying artificial intelligence (AI) techniques to robotic systems. AI, and in particular, machine learning (ML), has accelerated in recent years in its applications for data analysis and learning [18], with many areas of healthcare already making use of this technology for disease prediction and diagnosis [19, 20]. By using ML in the autonomous navigation of guidewires and catheters in MT, it is plausible that in endovascular specialities facing a shortage of highly trained operators, tele-operated MT may be performed safely and effectively by a few highly trained operators based in centralized neuroscience centres. Alternatively, less experienced operators—for example, endovascular operators (e.g. interventional radiologists who are not used to performing neurointerventional procedures)—may use AI-assisted robots in hospitals that are not neuroscience centres (the majority of hospitals are not neuroscience centres). Potentially, such developments would lead to greater accessibility of MT globally [21].

Several papers have investigated the potential use of ML in the automation of catheters and guidewires for endovascular interventions, with a recent systematic review finding 80% (8/10) of studies (all published after 2018) implemented some form of reinforcement learning (RL) [21]. RL is a subset of ML where an agent learns by interacting with the environment, receiving feedback as rewards, and aims to minimize cumulative rewards over time by optimizing its actions based on the current state [22]. However, this trial-and-error approach means that RL often requires many interactions with the environment to learn the optimal policy [23]. ‘Demonstrator data’ has been utilized previously in a variety of ways: it has been used to establish control policies for RL to optimize [24], enhance training speed [25], and also function as high-priority samples in the reinforcement learning replay memory [26]. While a ‘dense reward’, characterized by frequent feedback, has been shown to provide quicker training times than a ‘sparse reward’, which offers feedback less frequently [25], there has been no previous work to use demonstrator data to determine a reward function specifically for autonomous endovascular navigation tasks. Inverse reinforcement learning (IRL) is able to do this by using an agent to infer the underlying reward function from the observed behaviour of an expert [27]. By leveraging the knowledge of experts, the number of trial and error cycles needed can potentially be reduced, leading to faster and more effective learning of complex tasks.

Fig. 1
figure 1

MT environment a used in simulations, with anatomy labelled and all possible targets, b with insertion point and example navigation path to target in each common carotid artery

This study aimed to leverage expert demonstrators through IRL to train autonomous guidewire and guide catheter navigation in simulated MT environments (i.e. in silico), by navigating from the common iliac artery to the distal aspect of the cervical segment of the ICA (representing the early stages of navigation which is completed by obtaining this stable guide catheter position). A primary objective was to train and test three models, each with a different reward function, using the soft actor-critic (SAC) algorithm [28]. A secondary objective was to conduct an analysis of one- and two-device tracking. This study contributes to the advancement of autonomous MT navigation and demonstrates a novel application of IRL in the simulated MT vasculature environment, filling a significant research gap in the field [21].

Methods

In the following sections, we present the navigation task, the environment required to simulate MT, and the simulation data collection process of expert demonstrator data. Subsequently, we describe the IRL approach for devising a reward function.

Navigation task

In MT, a guidewire is typically used to navigate a guide catheter to the ICA. An ‘access catheter’—one with a distal shape to allow access into vessel branches—is placed within the guide catheter and taken ahead of the guide catheter tip during navigation. Once the access catheter is within the ICA, the guide catheter is advanced to make a stable platform. At this point, the guidewire and access catheter are retracted, and a microguidewire within a microcatheter is together passed through the stable guide catheter and navigated to the target site. The navigation task presented in this paper simulates this complete navigation of guidewire and access catheter. A target location, randomly sampled from 30 centreline points within the right internal carotid artery (RICA) or left internal carotid artery (LICA), is chosen to be navigated from an insertion point in the common iliac artery; 20 targets in each branch were selected for training, and the remaining 10 hold-out target locations were used for evaluation. For each navigation attempt, the train or test target location is changed. Figure 1a shows the simulation environment, with anatomy labelled, and the possible targets. Figure 1b shows the insertion point (standard across all runs) and an example navigation path to a target within each carotid artery.

Simulation environment

The in silico environment for the navigation task builds upon prior contributions from [17, 29], both of which use the Simulation Open Framework Architecture (SOFA), along with the modified BeamAdapter plugin, to create a comprehensive simulation-based training and evaluation environment [30, 31]. SOFA allowed replication of vascular structures derived from two computed tomography angiography (CTA) scans. The first scan encompassed the abdominal and thoracic regions, including the femoral arteries, descending aorta, and the aortic arch. The second CTA scan encompassed the aortic arch and extended to the cerebral vessels, including the carotid and vertebral arteries in the neck. These scans were subsequently processed into surface meshes, each consisting of approximately 850 vertices, which were then manually merged to create a cohesive and continuous model using Blender [32].

Integrating catheters (characterized by 160 vertices with Young’s modulus of 47 MPa) and guidewires (comprising 120 vertices with Young’s modulus of 43 MPa) within the in silico environment were facilitated by the BeamAdapter plugin developed for SOFA. This plugin was used to model the Penumbra Neuron MAX 088 guide catheter (Penumbra, California, USA) and the Terumo 0.035 guidewire (Terumo, Tokyo, Japan) to generate the initial topology and mechanical properties needed for the simulation.

The simulation assumed rigid vessel walls with an empty lumen. The simulation’s fidelity to real-world guidewire behaviour was achieved through iterative fine-tuning of parameters such as the friction between the guidewire and vessel wall and the guidewire’s stiffness, as demonstrated by [29] and [33], the latter of which has shown promising translation from in silico to an ex vivo porcine liver. A qualitative survey shows that six experienced (greater than five years clinical experience) interventional neuroradiologists (UK consultant grade; US attending equivalent), two novice (less than three years clinical experience) interventional neuroradiologists, and one interventional radiologist (UK specialist trainee grade; US fellow equivalent) agreed that the vessel anatomy was accurate, agreed the guide catheter performed realistically, and were indifferent to the realism of the guidewire [17].

Input parameters for the simulation, including guidewire rotation and translation speed, were applied at the proximal end of the catheter and guidewire. The output of the simulation was the catheter and guidewire position as a series of coordinate points directly extracted from the finite element model. The rotational speed was constrained to a maximum of 180 \(^{\circ }\, \textrm{s}^{-1}\), while the translational speed was capped at 40 \(\textrm{mm s}^{-1}\). Feedback during the navigation was given as two-dimensional \((x',z')\) tracking coordinates of the device, and no feedback about the vessel geometry was given. Experiments were performed using solely guidewire-tracking (single tracking) and combined guidewire and cathetertracking (dual tracking) methods.

Demonstrator data collection

The navigation task was performed in silico on ten random targets in each carotid artery for demonstrator data collection. Navigation of the targets from the same insertion point in Fig. 1b was performed using inputs from a keyboard. The action (device rotation and translation), device tracking data, and target location were collected at each simulation step. Data were split based on if the target branch was on the left or right side. An example navigation path of guidewire and catheter is shown in Fig. 3a.

IRL algorithm

IRL was used to determine a suitable reward function from the collected demonstrator data. The maximum entropy IRL (MaxEnt IRL) algorithm was employed to learn the reward functions based on expert demonstrations, specifically for navigating through different branches of vasculature [34]. The algorithm’s objective is twofold: it seeks to replicate the observed behaviour of experts and ensures that the learned reward functions capture a distribution of behaviour with maximum entropy. To achieve this, MaxEnt IRL involves an iterative process of updating reward functions to maximize the entropy of the expected policy while maintaining consistency with observed expert behaviour. The reward functions are initialized randomly, with the same random seed across experiments, and updated using gradient ascent to minimize the entropy of the policy.

For each branch, the reward function is updated using the gradient ascent step in Eq. (1), where \(R_i\) is the reward function for branch i, \(\nabla R_i\) is the gradient vector, and \(\eta \) is the learning rate.

$$\begin{aligned} R_i \leftarrow R_i + \eta \nabla R_i \end{aligned}$$
(1)

The MaxEnt IRL objective is expressed as in Eq. (2), for a policy \(\pi \left( a\mid s \right) \), which is the probability distribution over actions given a particular state.

$$\begin{aligned} Maximize \mathbb {E}_\pi \left[ -\sum _a \pi \left( a\mid s \right) \log \pi \left( a\mid s \right) \right] \end{aligned}$$
(2)

The Boltzmann policy calculation for action a given state s is defined as Eq. (3).

$$\begin{aligned} \pi \left( a\mid s \right) = \frac{e^{R_i \left( s,a \right) }}{\sum _{a^i}e^{R_i \left( s,{a}' \right) }} \end{aligned}$$
(3)

A feedforward neural network with four fully connected layers trained the model using expert trajectories, optimizing the policy through \(1 \times 10^6\) iterations. A separate model was trained for each set of targets in each branch. At the start of a new navigation task, the correct model based on the target branch was obtained, and at each step, a reward value was returned based on the current observation given to the model.

Controller architecture, reward functions, and training

In this study, all RL models were trained using a SAC controller, adapted to facilitate two-device tracking, and accommodate various reward functions from the architecture developed in previous work by Karstensen et al. [29]. The architecture consists of a Long Short-Term Memory (LSTM) layer, which functions as an observation embedder, enabling the learning of a trajectory-dependent state representation. The subsequent feedforward layers are responsible for learning the control of the guidewire. The LSTM-based observation embedder is updated exclusively using the q1-network. In this architecture, the controller receives an observation as input, and the Gaussian policy network generates parameters, specifically mean (\(\mu \)) and standard deviation (\(\sigma \)), of a normal distribution for the following action. During training, actions are sampled from this normal distribution, whereas for evaluation, \(\mu \) is directly employed as the action, rendering the behaviour deterministic. Training was completed on a NVIDIA DGX A100 node with 8 GPUs (Santa Clara, California, USA).

The action was defined as the output of the Gaussian policy network representing the catheters’ and the guidewires’ rotation and translation speed. Three points on the instrument tip described the instrument’s position, denoted as \((x', z')_{i=1,2,3}\) and \((x',z')_{1}\) coinciding with the instrument tip. Additionally, the target position was specified by the current target’s \((x', z')\)-coordinates. The observation comprised the current and previous guidewire and catheter positions, the target position, and the previous action taken from the last to the current position.

The controller training procedure performed the navigation task for \(1\times 10^{7}\) exploration steps, defined as a single control loop cycle during the exploration phase. Each navigation task was considered an episode and was deemed complete once the target was reached within a 5 mm threshold. To ensure computational efficiency, we introduced a timeout after 400 exploration steps (equivalent to approximately 53 s) without reaching the target. The control frequency was set to 7.5 Hz.

Three types of reward functions were used in this study to determine whether deriving a reward function from demonstrator data provides any benefit for autonomous endovascular navigation. Each reward function is calculated in every simulation step from the environment state based on the agent’s actions. Therefore, actions will lead to different reward quantities across reward functions and, hence, a variation in the final model.

The current state-of-the-art algorithm for autonomous navigation in endovascular interventions is devised by [29]. This algorithm uses a dense reward function, \(R_{1}\), which does not use IRL and is shown in Eq. (4). Here, pathlength is defined as the distance between the guidewire tip and the target along the centrelines of the arteries, with \(\Delta \text {pathlength}\) representing the change in pathlength at time t from the previous step at time \(t=-1\).

$$\begin{aligned} R_{1}= & {} -0.005 - 0.001\cdot \Delta \text {pathlength}\nonumber \\{} & {} + {\left\{ \begin{array}{ll} 1.0 &{} \text {if target reached} \\ 0 &{} \text {else}\end{array}\right. } \end{aligned}$$
(4)

The second reward function \(R_{2}\) was derived using IRL, as seen in Eq. (5), where \(R_\textrm{RICA}\) and \(R_\textrm{LICA}\) are equal to \(R_{i}\) in Eq. (1) calculated for the right and left carotid arteries, respectively.

$$\begin{aligned} R_{2} = {\left\{ \begin{array}{ll} R_\textrm{RICA} &{} \text {if target in RICA}\\ R_\textrm{LICA} &{} \text {if target in LICA} \end{array}\right. } \end{aligned}$$
(5)

The third reward function \(R_{3}\) was obtained through reward shaping using a combination of the dense reward function \(R_{1}\) and the IRL-derived reward function \(R_{2}\), as shown in Eq. (6). \(\alpha \) is a scaling factor used to provide the correct ratio of \(R_{1}\) and \(R_{2}\). A scaling factor of 0.001 was used in this study.

$$\begin{aligned} R_{3} = R_{1} + \alpha R_{2} \end{aligned}$$
(6)

Evaluations were conducted every \(2.5 \times 10^5\) exploration steps for 100 episodes. Ten targets in each branch unused during training were utilized for evaluation. Each evaluation step records the success rate, procedural times, and path ratio. Comparison is performed between the highest success rate of each model, and the corresponding path ratio and procedure times at this evaluation step. The success rate is the percentage of evaluation episodes in which the controller successfully reaches the target; path ratio is a measure of the remaining distance to the target point in unsuccessful episodes, calculated by dividing the remaining distance by the initial distance; and procedure time is the time taken from the start of navigation to the target location for successful episodes. Exploration steps is the number of training steps taken to reach the point at which the results are provided. Comparative statistical analyses were conducted using two-tailed paired Student’s t-tests, with a predetermined significance threshold set at p = 0.05.

Table 1 Results of device tracking training
Fig. 2
figure 2

a Success rate (%), b path ratio (%) during training for single- versus dual-device tracking

Fig. 3
figure 3

Trajectories of catheter and guidewire tip for a demonstrator data, b single device tracking (with final catheter position highlighted), c dual device tracking (with final catheter position highlighted), and d IRL

Results

Single- versus dual-device tracking

Results of an investigation into the difference in success rate, procedure times, and path ratio during training between tracking solely the guidewire and tracking both the guidewire and catheter are shown in Table 1. Figure 2 shows the success rate and path ratio during training. Single tracking and dual tracking reach similar maximum success rates of 95% and 96%, respectively (\(p= 0.741\)), with single tracking having a lower procedure time of 22.5 s compared to 24.9 s for dual tracking (\(p=0.047\)). Figure 3 compares the guidewire and catheter trajectories from insertion to a target in the LICA of single and dual-device tracking. The final tip position of catheters for these two models is highlighted for comparison. For single tracking, the catheter is utilized sparingly and is not navigated out of the common iliac artery. In contrast, during dual tracking navigation, the catheter is used throughout the navigation task and is taken up the aortic arch to provide stability. Dual tracking navigation, therefore, mimics neurointerventional radiology experts, whereas single tracking navigation does not.

Table 2 Results of reward function training
Fig. 4
figure 4

a Success rate, b path ratio during training for a dense reward function, IRL-derived reward function, and reward shaping function

IRL and reward shaping

Table 2 shows the final success rate, procedure time, and path ratio for a dense reward function alone, an IRL-derived reward function alone, and a reward function obtained through reward shaping (which combines a dense reward function and an IRL-derived reward function). All experiments are performed using dual tracking. Figure 4 shows the success rate and path ratio during training. Reward shaping provides the highest success rate of 100%, followed by 96% for the dense reward function (\(p=0.045\)). Reward shaping also has the lowest procedure time out of the examined reward functions at 22.6 s, compared to 24.9 s for the dense reward function (\(p=0.0001\)). Figure 3d shows an example navigation path for guidewire and catheter of the IRL-derived reward function at \(0.76 \times 10^6\) exploration steps, where it can be seen that the low success rate of 48% can be attributed to catheterization of the left vertebral artery, rather than the right carotid artery.

Discussion

This study was the first to perform a navigation task that simulated the MT intervention by autonomously navigating both guidewire and catheter from an insertion point in the femoral artery to a target in the carotid artery. Furthermore, our study introduced for the first time dual device for autonomous endovascular navigation, providing information about catheters and guidewire during RL training. This study was also the first to investigate reward functions in autonomous endovascular navigation using IRL to identify a specific reward function from demonstrator data for RL training.

There have been no studies applying RL in any form to any task related to MT [21]. Our study demonstrated proof of concept of in silico autonomous navigation of instruments in MT. The implication is that if further validation studies demonstrate benefit, our approach can plausibly increase patient accessibility to MT while increasing intervention safety and speed.

Single versus dual tracking

Both models showed an increase in path ratio and success rate over training. Results showed that single and dual tracking models reach a similarly high success rate after training, but the single tracking method records quicker procedure times. However, investigation into the dual tracking models shows that the catheter is utilized from an early stage during navigation, with the movements performed similar to those of an experienced neurointerventional radiologist. The catheter has a higher stiffness than the guidewire, and hence, in the dual tracking model, the catheter is taken up the aortic arch to provide stability as the guidewire is navigated into the branching point of the common carotid arteries. For this reason, dual tracking was used in all reward function experiments, as it better replicates the actions of an experienced operator, with no trade-off for success rate.

It is noteworthy that in this experiment, a single static mesh and a relatively simple navigation task were used. It is plausible that in a more challenging navigation task, for example, where the anatomy is more tortuous, as seen in old and hypertensive patients, the dynamic use of two instruments (dual tracking with guidewire and catheter) would enable a higher success rate.

Reward function

Results showed that reward shaping using a combined learned IRL and a dense reward function gives a higher overall success rate and a lower procedure time than the state of the art [29]. The model trained using reward shaping can use demonstrator data through IRL to navigate towards the target, while the dense reward function provides an incentive for reaching and moving towards the target while promoting this in the lowest number of steps possible.

Training using only an IRL-derived reward function did not reach the same success rates as the other models, although it can be seen that the model was able to navigate towards the target. However, the low success rate comes from wrong branch catheterization, which could be caused by the IRL-derived reward function not taking into account the vessel walls and, therefore, receiving a high reward signal for a small Euclidean distance in the coordinate system, while the pathlength is still significant. For anatomical clarity, the distal vertebral artery and the ipsilateral internal carotid artery are approximately 10 mm from each other yet both arteries are in continuity with the aortic arch allowing the guidewire to move seamlessly to either location. These issues were overcome through reward shaping, as by providing a negative reward for moving away from the target via pathlength, the success rate can be increased dramatically.

A model with a high path ratio is still valid for all experiments performed within this study. Failures where the path ratio was above 80% usually meant that the correct branch was catheterized; however, the guidewire did not reach the target point. Despite this, the guidewire would be in the correct region of the vessel and be helpful for the operator and subsequent steps in MT. After all, in MT the initial aim simulated by our experiments is to navigate to this approximate area within the ICA with a standard guidewire facilitated by an access catheter. Thereafter, a guide catheter is advanced to this approximate area, and then, a microguidewire and microcatheter are used. The exact location of the initial navigation within the ICA is often less important than simply having a guidewire in the ICA that is stable enough to allow the subsequent endovascular steps.

Limitations

While the results of this investigation towards an autonomous MT navigation system fall under a technology readiness level (TRL) of 3 [35], this proof-of-concept work showed that IRL can be leveraged effectively as part of reward shaping to deduce a suitable reward function for RL training; there are limitations to the methodology employed in terms of utility. In vitro (e.g. phantom) and clinical validation steps would be required to progress the TRLs.

First, demonstrator data were obtained using a simple keypad controller; therefore, to improve the reward function’s suitability, data could be collected to accurately reflect the actions performed by an experienced neurointerventional radiologist. Furthermore, 20 demonstration trajectories were used for IRL, and a higher sample size may provide a more generalized reward function. Additionally, the reward function could be altered to give consideration to safety during the intervention. By providing a negative reward for vessel wall contact forces over a threshold, the puncturing of a vessel wall causing a perforation can be prevented.

Second, while different targets were used for training and testing, the same mesh was used throughout the experiments. Future in silico validation work should apply these algorithms to a range of patient anatomy (a dataset with multiple training and hold-out testing unique patient anatomy) to increase generalizability, improving the effectiveness during future in vitro experiments.

Third, simulations in this study were performed using tracking coordinates of the guidewire and catheter tip. An alternative to this would be to use an image-based tracking system, which may have more clinical relevance, as the autonomous system could use fluoroscopic imaging that already takes place during MT interventions. For future in vitro experiments, further work will investigate the effectiveness of the techniques demonstrated in this study to imaged-based tracking inputs and those captured by device tracking.

Fourth, recent clinical trials of robotic diagnostic cerebral angiography recommend that procedural times, success rates, fluoroscopy times, and radiation doses be compared with the traditional manual approach [36]. The measurement of fluoroscopy times and radiation doses is outside the scope of this in silico study. However, success rate, procedural times, and path ratio were recorded at each evaluation step. Future in vitro experiments could measure fluoroscopy times and radiation doses.

Our study successfully addresses the task of simulating the complete navigation of guidewire and guide catheter in MT, which takes place from the common iliac artery to either the LICA or RICA where the guide catheter is in a stable position. However, a complete MT procedure typically involves a subsequent navigation step from the LICA or RICA to the MCA. Such a subsequent navigation to the MCA requires different instrumentation (often microcatheters and microwires) beyond the scope of our investigation. Therefore, while this study provides valuable insights into a specific aspect of the MT procedure involving several steps where there is challenging anatomy, it does not represent a comprehensive examination of the entire MT navigation procedure.

Conclusion

We showed the feasibility of autonomous guidewire navigation throughout the MT vasculature using IRL with demonstrator data for the first time. We established a comprehensive simulation-based training environment and, for the first time, compared models with different reward functions for autonomous endovascular navigation by utilizing SOFA and SAC. Reward shaping emerged superior, with higher success rates and lower procedure times than the state of the art. Single- versus dual-device tracking revealed that both methods achieved high success rates, but dual tracking utilized both devices, effectively mimicking experienced neurointerventional radiologists. We have highlighted several opportunities for future research beyond our proof-of-concept experiments, in particular relating to improving performance. Another opportunity is to investigate the final stage of navigation which is typically from the distal aspect of the cervical ICA to the M1 segment of the MCA using microcatheter and microwire. Additionally, researchers may wish to apply our approach to other non-MT endovascular systems. In summary, this work offers promising avenues for improved procedural accessibility and precision of in silico autonomous endovascular navigation tasks. For MT, the work plausibly lays the foundation for potentially transformative patient care.