Introduction

Screening of patients for aneurysm clips and other metallic devices prior to magnetic resonance imaging (MRI) is vital to ensure that the patient and device can be scanned safely. There have been numerous makes and designs of aneurysm clip over decades [1], many of which have been categorized as MRI conditional. For these particular implants, MRI is not absolutely contraindicated, but the devices need careful prior assessment to ensure that the scan takes place under manufacturer-specified conditions. However, not all historic clips are MRI safe, and even those that are safe in some conditions may not be safe in all conditions [2]. At least one fatality has been caused by the displacement of an aneurysm clip [3]. Safe examination requires review of medical records and co-ordination of multiple experts [4]. Late detection has the potential to result in last-minute cancellations and wasted scanner time. Failure to perform the required checks can result in device dysfunction with potential harm to the patient.

MRI is the standard imaging modality for many conditions. Appropriate screening policies and procedures are essential before permitting entry to the MRI scanner to prevent injury [5]. Best practice is to use referrer and patient questionnaires to identify patients with devices or other issues that need further investigation. Questionnaires are not fail-safe as referrer responses can be unreliable and patient responses are often not available until the day of the scan.

In the last decade, there have been significant advances in AI-based medical image classification due to increased compute power, the open-sourcing of large labelled datasets, and the development of deep learning [6]. Deep learning describes the subset of machine learning which uses layered neural networks to build representations of complicated concepts out of simpler concepts [7]. This negates the need for feature extraction, as required by other methods, and streamlines the preprocessing pipeline [8]. The success of deep learning methods in image classification tasks is well-documented, and for the last decade they have exceeded the performance of many other state-of-the-art classification algorithms [9]. There are now thousands of publications applying deep learning techniques to medical imaging [10].

We describe the design of a deep learning model for the detection of the presence of aneurysm clips in computerized tomography (CT) head scans. The vast majority of patients with aneurysm clips will have had CT head imaging previously performed as part of their treatment, presenting the potential to screen these previous scans as part of an automatic pre-MRI safety check. This would improve MRI safety, reduce last-minute cancellations, and save time and resources.

Materials and Methods

Ethical approval was granted on 15 October 2019 by HRA and Health and Care Research Wales. Data were obtained from Derriford Hospital, a large teaching hospital with a regional neurosurgery centre serving the South West of the United Kingdom. The study design was retrospective and observational using pre-existing medical image data.

Subject Inclusion

A database of patients with aneurysm clips was used to identify cases for inclusion in the study. A list of all patients undergoing aneurysm clip surgery was identified from surgical records. The radiology information system (RIS) (Cris, Wellbeing Software) was used to identify all post-surgical CT head examinations for these patients. A custom SQL query was then used to search the RIS for matched controls. For each scan with an aneurysm clip present, a scan with no aneurysm clip present was identified. These control scans were matched according to:

  • Scan type

  • Age at time of scan, within a window of ± 6 months

  • Scan date, within a window of ± 12 months

  • Gender

Image Data Acquisition

Images for the investigations identified on the RIS were downloaded from PACS using dcmtk (OFFIS e.V.) [11]. These studies were anonymized using custom anonymization software based on the Clinical Trials Processor (RSNA MIRC project) [12].

Ground Truth Confirmation

Manual review of images was performed by two board-certified radiologists to ensure correct labelling. In the event of any disagreement of the correct labels, a third board-certified radiologist reviewed the case to confirm the correct labelling.

Split

Two sets of images were extracted from the fully curated dataset: a set of localizers and a set of full CT heads. Most CT scan studies begin with one or more localizer scans. These are of poorer quality than full CT scans, but aneurysm clips can often still be clearly seen (Fig. 1). Localizer scans acquired in the same plane were identified automatically using the DICOM tags. From the fully curated dataset, 274 scans were identified which contained sagittal localizers: 136 with aneurysm clips and 134 without. These localizers were randomly divided at a scan level: 28 scans (10%) were reserved as a holdout test set (10 with aneurysm clips and 18 without). The remaining 246 (90%) were used for model development (126 with aneurysm clips and 120 without).

Fig. 1
figure 1

Sagittal localizer with aneurysm clip present, circled

To standardize the full CT head dataset, scans reconstructed using the same kernel were identified automatically using the DICOM tags. From the fully curated dataset, 214 scans were identified which had been reconstructed using a bone kernel: 104 with aneurysm clips and 110 without. These were randomly divided at a scan level: 22 scans (10%) were reserved as a holdout test set (11 with aneurysm clips and 11 without). The remaining 192 (90%) were used for model development (93 with aneurysm clips and 99 without).

For both localizers and full CT heads, fivefold cross-validation was used to develop and assess models, with the data divided into 80% training data and 20% validation data in each fold.

For both types of image, the five developed models were then finally tested on the holdout test set.

Image Preprocessing

The images were preprocessed before model input by a deterministic automatic pipeline developed in Python using tools from OpenCV [13], SciPy [14] and scikit-image [15]. For the two-dimensional localizer scans, black borders were removed. Pixel values were rescaled between zero and one. Images were cropped to contain the head only, and the bottom of the images removed to exclude the mandible. This optimization was included after the explainability technique revealed that models were being confounded by the presence of fillings, resulting in false positive results. Images were resized to 400 \(\times\) 400 pixels.

For the three-dimensional scans, the Hounsfield values were clipped with a level of 2000 and a window of 500 to optimize the visibility of metal. Voxel values were scaled between zero and one. Images were cropped to contain the head only and resized to 256 \(\times\) 256 \(\times\) 40 voxels.

Neural Network Architecture

Python-based deep neural networks were built with Keras [16] using the TensorFlow backend [17]. Graphics processing unit hardware acceleration on an NVIDIA GeForce RTX 3080 was used for neural network training. Jupyter Lab [18] was used for model development to enable iterative improvements to be made efficiently.

Fig. 2
figure 2

Network architectures

For the classification of the two-dimensional localizer images, a convolutional neural network based on a pre-trained model was selected as a proven choice for computer vision and image classification tasks using transfer learning [10]. Several well-established pre-trained base networks were trialled, including VGG16 [19], Inception V3 [20], Xception [21], DenseNet [22] and MobileNet V2 [23]. Following analysis for each model, MobileNet V2 achieved the greatest performance and was chosen for the final models (Fig. 2a).

For the classification of the three-dimensional CT images, a three-dimensional convolutional neural network was trained from scratch, due to a lack of available pre-trained three-dimensional classification networks [24]. Several different hyperparameter configurations were trialled. Following curve analysis for each iteration, the one which achieved the smallest loss on the validation data was chosen for the final models (Fig. 2b).

Model Training

The models were trained for a maximum of 100 epochs using stochastic gradient descent with the Adam optimization algorithm (learning rate 0.001) [25]. The binary cross-entropy loss function was utilized. The batch size was 64. The images were augmented with a 50% probability of horizontal flip. Other augmentation methods were trialled, but did not result in any further increase in performance. The models achieving the lowest loss on the validation sets during training were saved using checkpoints.

A classification threshold was then chosen for the models which maximized sensitivity, and therefore minimized the prevalence of false negatives.

Explainability

SHapley Additive exPlanations (SHAP) were used to explain the 2D models’ predictions. SHAP uses the game theory concept of Shapley values to calculate the contribution of a factor to a machine learning model output [26]. In this case, DeepSHAP was used to calculate and visualize the contribution of individual pixels to the deep learning model’s prediction.

Results

Localizer Images

Of the pre-trained base models trialled for the localizer images, MobileNet V2 achieved the greatest mean test Receiver Operating Characteristic (ROC) area under the curve (AUC) and was chosen for the final models. Other base model results are reported in Table 1.

Table 1 Performance of different base models for localizer images

A classification threshold of 0.16 was chosen to maximize sensitivity whilst maintaining a high accuracy and specificity (Fig. 3). The final models achieved a mean test sensitivity of 100%. Other performance metrics are reported in Table 2.

Fig. 3
figure 3

Mean test performance metrics for MobileNet V2 models in training

Table 2 Performance metrics for MobileNet V2 models with classification threshold of 0.16

When tested on the holdout test set of 28 localizer images, the final models achieved a sensitivity of 100%. Other performance metrics are reported in Table 2.

Incorrectly Classified Examples

The incorrectly classified 2D localizer images were analysed using the SHAP explainability method. In the early stages of the research, this demonstrated the need to remove the mandible from the images, as prior to this removal the models were confounded by the presence of fillings.

After the images had been cropped and models developed, the SHAP explainability method was used to analyse the incorrectly classified examples in the holdout test set. Three of the 28 images were incorrectly classified by all five models, and five other images were misclassified by at least one of the models. All of these errors were false positives. The average SHAP maps show that bright areas have contributed to the models’ incorrect predictions, including other metal devices (Fig. 4a).Footnote 1

Fig. 4
figure 4

Maps of average SHAP values. Any pixels highlighted in red have contributed to the prediction that an aneurysm clip is present; any pixels highlighted in blue have contributed to the prediction that no aneurysm clip is present. In the case of the true positive, the aneurysm clip has been circled in green for clarity

Correctly Classified Examples

The SHAP explainability method was also used to analyse the localizer images that the models classified correctly. Of the 28 images in the holdout test set, 20 were classified correctly by all five models. The average SHAP maps for the true positives show that the pixels containing aneurysm clips contributed positively to models’ correct predictions that a clip is present (Fig. 4b).Footnote 2 The signal is much stronger than the confounding signals in the false positive predictions, and is much stronger than any signal in the true negative predictions where no clip has been detected (Fig. 4c).Footnote 3

Three-Dimensional CT Images

After models had been trained on three-dimensional CT images, a classification threshold of 0.30 was chosen to maximize sensitivity whilst maintaining a high accuracy and specificity (Fig. 5). The final models achieved a mean test sensitivity of 96%. Other performance metrics are reported in Table 3.

Fig. 5
figure 5

Mean test performance metrics for 3D models in training

Table 3 Performance metrics for 3D models with classification threshold of 0.30

When tested on the holdout test set of 22 three-dimensional CT images, the final models achieved a mean sensitivity of 96%. Other performance metrics are reported in Table 3. Of the 22 images, 19 were correctly classified by all five models. Of the three images that were incorrectly classified by at least one model, two were false positives and one was a false negative.

Discussion

Deep learning has previously been used successfully to detect medical implants. Pre-trained convolutional neural networks have been used to detect pacemakers in chest radiographs with 99.67% accuracy [27] and spinal implants in lumbar spine lateral radiographs with 98.7% precision and 98.2% recall [28]. A convolutional neural network trained from scratch has been used to identify dental implants in X-ray images with 94.0% segmentation accuracy and 71.7% classification accuracy [29]. In another application, a segmentation network has been developed to identify orthopedic implants in hip and knee radiographs with 98.9% accuracy and 100% top-three accuracy, exceeding the performance of five senior orthopedic specialists [30].

The successful implementation of deep learning for implant detection is continued in this application, the first to use deep learning to detect aneurysm clips. The trained models exhibit excellent performance for both localizer images and full CT head scans. Both types of model generalize well to the unseen data in the holdout sets and score particularly highly in terms of sensitivity. The sensitivity for the localizer models is 100% in both the training and the holdout data: there are no dangerous false negatives. The computational resources required to run the models are particularly low in the case of the localizer images.

The use of an explainability method is particularly valuable in this application because it demonstrates that the correct parts of the localizer image are informing the models. In general, the positive (red) signal in the images is strongly localized and more observable than the negative (blue) signal, which is weaker and more distributed. This suggests that the models are being positively informed by the presence of aneurysm clips, and are being informed on a more widespread and low level by the absence of aneurysm clips.

As this application is a potential safety tool, the models have been developed and classification thresholds chosen to maximize sensitivity and minimize false negatives. As a result, they are sometimes confounded by other bright areas in the images, making some false positives likely. This could create additional work for a human operator, but it is a preferable error to dangerous false negatives. The heatmaps also demonstrate that other metal devices such as skull flap fixing plates and skin clips can be responsible for false positives (see Supplementary Fig. 1). These are still valuable to detect for MRI safety. Future work could assess these models on a CT head dataset incorporating a wider range of metallic implants, to analyse whether models trained to detect aneurysm clips specifically generalize to metal implant detection more broadly.

It was anticipated that models developed for full CT heads might perform better than models developed for localizer scans, as the aneurysm clip would be presented in three dimensions and in greater detail. However, the sensitivity of the three-dimensional models was slightly poorer. This may have been due to the presence of too much other confounding detail, or may have been due to the models having been trained from scratch rather than taking advantage of pre-learned patterns. Pre-trained networks were used for the localizer scans due to their ready availability for transfer learning in two-dimensional image data. At this time, there is a notable lack of equivalent pre-trained networks available for transfer learning in three-dimensional image data. If pre-trained three-dimensional networks become available in the future, then they might be successfully leveraged in this application.

Future work could consider using an ensemble model. Ensemble methods are considered the state of the art for many machine learning applications, as they harness the power of weaker learners [31]. An ensemble model for this application could incorporate different learning algorithms, as well as bagging or boosting approaches.

Limitations

The size of the data is a limitation of this research, caused by the rarity of CT scans depicting aneurysm clips. If it were possible to obtain more data this might enable the development of even more accurate models in training, and enable more representative assessment of models in the holdout set. We have mitigated this limitation to an extent by augmenting the training data with horizontal flip, thus artificially increasing the size of the dataset.

Another limitation of this research is the lack of external validation. External validation sets are difficult to obtain as appropriate publicly available databases do not exist. Our research team is in the process of planning and gaining governance clearance for such accessible studies. We have mitigated this limitation as far as possible in this study by reserving an unseen holdout test set. However, these data originate from the same source as the training data, and the metrics reported may not be representative of the models’ performance on data from a different distribution. For example, the balance of the data used in this study is not representative of the typical MRI patient population, in which only a small minority would have aneurysm clips present. An external validation set would allow for more accurate assessment of the models’ capability to generalize to other populations.

Conclusion

A pre-trained MobileNet V2 neural network achieved high accuracy and 100% sensitivity for the detection of aneurysm clips in CT localizer scans, and the explainability method demonstrated that the network was focusing on appropriate regions of interest in the images. A trained-from-scratch neural network also achieved high accuracy and sensitivity for the detection of aneurysm clips in full CT head scans. This application could be a useful addition to current processes, enabling automatic safety screening for devices in advance of MRI appointments.