A convolutional neural network with self-attention for fully automated metabolic tumor volume delineation of head and neck cancer in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[^{18}$$\end{document}[18F]FDG PET/CT

Purpose PET-derived metabolic tumor volume (MTV) and total lesion glycolysis of the primary tumor are known to be prognostic of clinical outcome in head and neck cancer (HNC). Including evaluation of lymph node metastases can further increase the prognostic value of PET but accurate manual delineation and classification of all lesions is time-consuming and prone to interobserver variability. Our goal, therefore, was development and evaluation of an automated tool for MTV delineation/classification of primary tumor and lymph node metastases in PET/CT investigations of HNC patients. Methods Automated lesion delineation was performed with a residual 3D U-Net convolutional neural network (CNN) incorporating a multi-head self-attention block. 698 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[^{18}$$\end{document}[18F]FDG PET/CT scans from 3 different sites and 5 public databases were used for network training and testing. An external dataset of 181 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[^{18}$$\end{document}[18F]FDG PET/CT scans from 2 additional sites was employed to assess the generalizability of the network. In these data, primary tumor and lymph node (LN) metastases were interactively delineated and labeled by two experienced physicians. Performance of the trained network models was assessed by 5-fold cross-validation in the main dataset and by pooling results from the 5 developed models in the external dataset. The Dice similarity coefficient (DSC) for individual delineation tasks and the primary tumor/metastasis classification accuracy were used as evaluation metrics. Additionally, a survival analysis using univariate Cox regression was performed comparing achieved group separation for manual and automated delineation, respectively. Results In the cross-validation experiment, delineation of all malignant lesions with the trained U-Net models achieves DSC of 0.885, 0.805, and 0.870 for primary tumor, LN metastases, and the union of both, respectively. In external testing, the DSC reaches 0.850, 0.724, and 0.823 for primary tumor, LN metastases, and the union of both, respectively. The voxel classification accuracy was 98.0% and 97.9% in cross-validation and external data, respectively. Univariate Cox analysis in the cross-validation and the external testing reveals that manually and automatically derived total MTVs are both highly prognostic with respect to overall survival, yielding essentially identical hazard ratios (HR) (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {HR}_{\text {man}}=1.9$$\end{document}HRman=1.9; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p<0.001$$\end{document}p<0.001 vs. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {HR}_{\text {cnn}}=1.8$$\end{document}HRcnn=1.8; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p<0.001$$\end{document}p<0.001 in cross-validation and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {HR}_{\text {man}}=1.8$$\end{document}HRman=1.8; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=0.011$$\end{document}p=0.011 vs. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {HR}_{\text {cnn}}=1.9$$\end{document}HRcnn=1.9; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=0.004$$\end{document}p=0.004 in external testing). Conclusion To the best of our knowledge, this work presents the first CNN model for successful MTV delineation and lesion classification in HNC. In the vast majority of patients, the network performs satisfactory delineation and classification of primary tumor and lymph node metastases and only rarely requires more than minimal manual correction. It is thus able to massively facilitate study data evaluation in large patient groups and also does have clear potential for supervised clinical application. Supplementary Information The online version contains supplementary material available at 10.1007/s00259-023-06197-1.


Neural network architecture
Automated lesion delineation was performed with a modified residual 3D U-Net CNN, see fig. 1. The network consists of encoder and decoder paths each of which represents an alternating sequence of residual blocks and, respectively, downsampling (2 × 2 × 2 max-pooling) and upsampling (3 × 3 × 3 deconvolution) steps. The residual block combines 3 × 3 × 3 convolution, batch normalization, leaky ReLU and dropout (rate = 0.1) layers and applies them to the input twice in a row while also providing a bypass skip-connection to improve the gradient flow and accelerate training. Since the number of input and output features can differ, 1 × 1 × 1 convolution is used in the skip-connection for equalization.
Features identified by the encoder are transferred to the decoder via skip-connections and concatenated with the corresponding decoder features at all image scales except the lowest one. At the bottom of the U-Net the encoder and decoder are connected through a Multi-Head Self-Attention (MHSA) block to improve awareness of the global context in classification of FDG-avid regions. MHSA was first introduced in [1] for natural language processing and was further adapted for image processing. Our implementation of MHSA mostly follows [2] but includes a few modifications. We refer to [2] for further details on, and motivation behind, MHSA while here we only provide a short overview of the method.
In our implementation, the MHSA block comprises two Self-Attention (SA) heads which share the same input. Each SA head establishes a rule according to which the information is transferred across the whole image. The rule is defined via projection operators (implemented as 1 × 1 × 1 convolution) of the input tensor into queries, keys, and values matrices: Q ∈ R n×k , K ∈ R n×k , and V ∈ R n×k , respectively. Here, n = 16 × 16 × 4 is the total number of voxels in the input tensor and k = 64 is the number of dimensions of the projection space. This way, each voxel i is represented by vectors q i , k i , and v i of length k which are stored as lines of the respective matrices. Values vector v j holds the relevant information about voxel j which is going to be transferred while vectors q i and k j determine the information transfer rate from voxel j to voxel i based on their similarity. In our implementation, we define similarity between q i and k j as a cosine of the angle between them calculated as a dot-product of the normalized vectors. The similarity coefficients of voxel i to every other voxel j are softmaxed and written in the form of an attention matrix A ∈ R n×n . Each matrix element A ij denotes the relative importance of the voxel j for classification of voxel i. The output of the SA head for each voxel i is the sum of values v j weighted with A ij : where the tilde denotes row-wise L 2 -normalization of the matrix and the softmax is applied row-wise. The outputs of all SA heads are concatenated and combined together via another 1 × 1 × 1 convolution (without bias).
Importantly, the SA mechanism has no means to account for the relative position of the voxels in the image. To give the SA heads access to the spatial information the input tensor is concatenated with a positional encoding tensor [1]. Moreover, we added a residual connection around the MHSA block to improve the gradient flow.
The described architecture was implemented using the Apache MXNet (version 1.9.0) package for the R language and environment for statistical computing (version 4.2.0) [3].  Figure 1: Architecture of the utilized CNN. Numbers above and beside each block designate the number of feature channels and matrix size at the given state, respectively. The images on the left are exemplary CT (top) and PET (bottom) images (input) and the image on the right is the corresponding output image (probability maps of the background, primary tumor, and lymph-node metastases shown in black, red, and green respectively)

Loss function and evaluation metrics
The loss function used for the network training is the sum of soft Dice and Cross-Entropy (CE) losses: where y ij and p ij are, respectively, ground truth (binary) and predicted probabilities of voxel i belonging to class j. N is the number of voxels in the training batch and C = 3 is the number of classes. Here, the smoothing constant ε = 1 is added to the denominator in (2) for numerical stability. The evaluation metric used for monitoring the training process in the validation data was the multiclass soft Dice function. The formula for the soft Dice metric function is similar to (2) with a difference in the expression sign and the fact that the loss function is calculated separately for each (relatively small) image batch while the evaluation metric is computed for the whole validation dataset at once. The variance of the evaluation metric was reduced in runtime using exponential smoothing (smoothing factor α = 0.6). All PET data were reconstructed including necessary attenuation, randoms, and scatter corrections. The corresponding CTs served also as input for training and testing. The matrix size of all CTs was 512x512. The in plane voxel size was (0.97 -1.37)mm and the off plane voxel size was (2 -5)mm. No postprocessing filter was applied to the data reconstructed with the BLOB-OS-TF method. All other data for which this information was available were postprocessed with a Gaussian filter with (2 -5)mm FWHM. In 176 cases this information was not available.

Fold 5
Epoch DSC Training subset Validation subset Best score Figure 2: Logs of the training process as represented by evaluation metric (multiclass soft Dice) for training and validation subsets. Each plot corresponds to the respective split of the main dataset into train-ing+validation and test subsets (fold). The dashed line marks the epoch for which the maximum of the evaluation metric was achieved. Note that training DSC stays generally lower than validation DSC due to dropout and data augmentations applied during the training.