1 Introduction

Prostate cancer is one of the most common neoplasms in men [6]. This indicates the importance of developing systems for its efficient detection, treatment, and monitoring. The gold standard of cancer diagnosis is the study of histopathology; however, due to high variability in the structure of the prostate gland, particularly among older patients, the selection of optimal sites for biopsy remains challenging. This explains the necessity of medical imaging. The most established imaging modality for prostate cancer detection is multimodal magnetic resonance imaging (MRI). However, the interpretation of the multimodal 3D images requires time and expertise from radiologists. The increasing average age of patients and the rising prevalence of cancers place intense pressure on medical organizations to supply enough skilled personnel to meet growing demand. One possible solution for alleviating this problem lies in the design of automated systems for cancer detection. This, in turn, has led to growing demand for high quality datasets and deep learning algorithms. Both solutions are undergoing active development at the National Information Processing Institute.

The selection of architecture is one of the most crucial decisions that influences a model’s performance. Until recently, most of the research conducted in computer vision was based on convolutional neural networks, during a time when natural language processing tasks witnessed an explosion of transformer-based architectures. However, according to new research in computer vision, transformer-based architectures promise performance that is consistently better than that of convolutional neural networks [8]. One of the main characteristics of convolutional neural networks is the enforcement of models to include information on the local co-occurrence of image features, which have been proven to be a significant inductive bias. Pure transformers do not share this characteristic; they learn the spatial correlations between image features via attention mechanisms. This adds a number of degrees of freedom to the models that enable them to learn the nonlocal, long-range dependencies in images, at the cost of requiring larger datasets to achieve the same performance. Moreover, the newest research [19] tackles the high memory requirements of nonmodified transformer architectures and the technical problems in training larger models on graphical processing units. One solution involves fusing convolutional and transformer-based architectures to take advantage of both using a hybrid transformer. This can be achieved by inserting a transformer into different layers of a U-shaped architecture, composing architectures, and using attention mechanisms on features calculated by convolutional neural networks [8]. The authors of this article concentrated on the first type of hybrid architecture, as they have already proven to be efficient in multimodal MRI settings [7] and specifically in prostate cancer detection [17]. At the time of writing, no consensus exists on the best available transformer-based architecture for prostate cancer detection and segmentation. This points to the necessity of further research and experimentation, the preliminary results of which are presented below.

2 Material and Methods

The data used to train and validate the model was accessed from Artificial Intelligence and Radiologists at Prostate. Cancer Detection in MRI: The PI-CAI Challenge [1]. The data encompasses 1,500 partially labelled cases of prostate parametric MRI (bpMRI). The labels, when present, indicate the locations of prostate cancer. The algorithm described below utilized T2-weighted imaging (T2W), axial-computed high value (\(\ge \) 1400 s /mm2) diffusion-weighted imaging (DWI), and axial-apparent diffusion coefficient maps (ADCs). The labels were annotated manually by human experts, and at least two changes were considered significant for the International Society of Urological Pathology (ISUP). The main library used in the work was Monai [4], which is a PyTorch-based [14] framework for deep learning in the medical imaging domain. To improve the code structure and training time, the code was refactored for use with Pytorch Lightning [5]. Image preprocessing was completed using the proposed algorithm from the PI-CAI Challenge [1], based on the nnUnet [9] architecture. All preprocessing steps were implemented as Monai transforms. Image augmentations were performed using the batchgenerators library [10]. To improve the reproducibility of the algorithm, training and inference were conducted using Docker containers [13]. All experiments were performed in the Google Cloud cluster using a server with NVIDIA A100 40 GB RAM GPU.

2.1 Preprocessing

The MRI data was normalized in each channel using z-score normalization. The image shape was set to (256, 256,32) for it to be a multiple of the sixteen in each axis, as the chosen architecture required. The spacing of the dataset was highly inhomogeneous; for this reason, all images were resampled to achieve (0.5,0.5,3.0) voxel size. Image augmentations were performed using the batchgenerators library [10] and encompassed Gaussian noise, elastic deformations, Gaussian blur, brightness modifications, contrast augmentations, simulations of low resolution, and mirroring. All of the labels were converted to binary masks and included in augmentations that led to spatial deformations of the original images.

2.2 Deep Learning Architecture

We selected Swin UNETR [7] as the architecture because it demonstrates characteristics that are crucial for the further development and finetuning of the algorithm on the new dataset in development. The neural network architecture is based on transformers. This has multiple advantages over traditional, convolution-based architectures. Primarily, it increases the receptive field, which enables the learning of long-range image dependencies. It partially avoids translation invariance of convolutions, which, in the context of medical imaging, can lead to the loss of relevant location-based information. Transformer-based architectures also have generally higher expressive power due to their less pronounced inductive bias. However, such architectures also cause difficulties due to their high memory footprint and relatively poor performance on small datasets (because of reduced inductive bias). The architecture is summarised in Fig. 1.

Fig. 1.
figure 1

A simplified schematic diagram of Swin UNETR on the basis of Fig. 1 from Hatamizadeh et al. The input comprises four channels with output of segmentation of whole gland, ADC, HBV, and T2W values [7]

For the current work and the dataset in development, the Swin UNETR architecture has additional crucial characteristics that are well suited to modelling multimodal images. As a transformer architecture, it is possible to extend Swin UNETR to incorporate clinical data in tabular form.

2.3 Optimization

The model’s optimization was implemented using the PyTorch AdamW [12] optimizer. Cosine annealing with warm restarts [11] was used for the Learning Rate Scheduler, and the initial learning rate was established by the Learning Rate Finder [18] implemented in PyTorch Lightning.

2.4 Hyperparameter Selection

Hyperparameter tuning was achieved using a genetic algorithm implemented in the Optuna [2] library. Hyperparameter tuning was used in the selection of the optimizer, architecture, and optimizer-related decisions like the Learning Rate Scheduler.

Fig. 2.
figure 2

A transverse T2W image of the prostate. In green, the gold standard label indicates prostate cancer; in blue, the changes detected by the neural network (and on the left, after being filtered by post-processing).

2.5 Postprocessing

The training was conducted as a five-fold cross-validation using splits provided by the contest organizers, and the outputs of each fold were combined by a mean ensemble algorithm. The model’s output was passed through a sigmoid activation function before lesion candidates were extracted using the report guided annotation library [3]. The proposed lesions were analyzed further by assessment of simple radiomic characteristics that are important for the task at hand; this can help increase the model’s precision by filtering out some false positive results. Proposed lesions were assessed for their:

  • size, where too big and small lesions were filtered out;

  • elongation and roundness, where highly elongated changes were filtered out, as they typically represented the obturator internus muscle or some of the large vessels in the pelvis;

  • the hypointensity of the ADC map and the hyperintensity of a high b-value DW image, defined as the difference of the mean value of complementary modalities concerning a lesion’s neighborhood. As the presence of hyperintense lesions on a high b-value DW image with related hypointense signal intensity on the ADC map is typical for prostate cancer, lesions that failed to meet this criterion were filtered out.

Figure 2 presents an example of the algorithm output, before and after the changes are filtered out by their radiomic features.

Fig. 3.
figure 3

A histogram of the distribution of roundness values: in blue, false positives; in red, true positives. The number of samples per histogram bin is scaled logarithmically.

Table 1. A summary of the simple shape statistics of segmented instances

3 Results and Discussion

Validation of the algorithm was performed using the Picai evaluation library [15] on the validation dataset provided by the contest organizers. Preliminary results for the model give a Ranking Score of 0.531, Area Under the Receiver Operating Characteristic curve of 0.686, and Average Precision of 0.376. An analysis of simple radiomic characteristics was performed and is summarized in Table 1. For each measured quantity—elongation, physical size, and roundness—the incorrect segmented instances presented approximately two times higher standard deviations, which indicates far higher variability. This also suggests a far wider distribution of the aforementioned quantities and the possibility of identifying suitable thresholds that define some of the segmented instances as false positives with high probability. As an example, in Fig. 3, one can observe that in the dataset, all segmented instances with roundness lower than 0.4 were false positives. A similar analysis can be performed for all other quantities. However, final conclusions regarding increases in model specificity using radiomic-based postprocessing require further study.

The results suggest that the model performs comparably to the state-of-the-art non-transformer-based baseline architectures provided by the contest organizers. However, a significant number of the top-ranking results that are presented on the contest leaderboard are based on transformer architectures. This demonstrates their impressive ability to learn the presented task and the presence of further opportunities for optimization.

4 Conclusions

This study indicates the usefulness of new transformer-based architectures in multimodal three-dimensional medical imaging. An additional feature considered necessary for analyzing the dataset is the proven ability of transformer-based architectures to incorporate data from different sources [16]. This provides a strong base for incorporating clinical data directly into the neural network architecture. Radiomic analysis performed in the postprocessing step proved helpful in the study by increasing the model’s specificity; work on more advanced radiomic analysis is fully justified. The use of model PyTorch-based libraries enabled efficient training, which supplies further proof of its efficiency. Such tools can serve as the basis for additional work on the algorithm’s development.