TRandAugment: temporal random augmentation strategy for surgical activity recognition from videos

Purpose Automatic recognition of surgical activities from intraoperative surgical videos is crucial for developing intelligent support systems for computer-assisted interventions. Current state-of-the-art recognition methods are based on deep learning where data augmentation has shown the potential to improve the generalization of these methods. This has spurred work on automated and simplified augmentation strategies for image classification and object detection on datasets of still images. Extending such augmentation methods to videos is not straightforward, as the temporal dimension needs to be considered. Furthermore, surgical videos pose additional challenges as they are composed of multiple, interconnected, and long-duration activities. Methods This work proposes a new simplified augmentation method, called TRandAugment, specifically designed for long surgical videos, that treats each video as an assemble of temporal segments and applies consistent but random transformations to each segment. The proposed augmentation method is used to train an end-to-end spatiotemporal model consisting of a CNN (ResNet50) followed by a TCN. Results The effectiveness of the proposed method is demonstrated on two surgical video datasets, namely Bypass40 and CATARACTS, and two tasks, surgical phase and step recognition. TRandAugment adds a performance boost of 1–6% over previous state-of-the-art methods, that uses manually designed augmentations. Conclusion This work presents a simplified and automated augmentation method for long surgical videos. The proposed method has been validated on different datasets and tasks indicating the importance of devising temporal augmentation methods for long surgical videos.


Introduction
In the context of computer-assisted interventions, reliable recognition of surgical activities is a fundamental component that could allow automatic analysis of the surgical workflow by providing the valuable semantic information required to support clinical decisions, generate reports, and annotate data [1,2]. These support systems could reduce surgical errors, increase patient safety, and help establish effective and efficient communication protocols [1][2][3]. Following the classification proposed in [4,5], surgical procedures can be divided into surgical activities at different levels of granularity: phases, steps, actions, and motions. Surgical phases are described as a set of surgical aims to be executed for successfully completing the surgical procedure, while steps are defined as a set of surgical actions that need to be carried out to complete a surgical phase. These different activities are annotated as temporal segments of the procedure. Moreover, when performing minimally invasive surgeries, a change in the viewpoint on the anatomy may be required for executing each individual activity.
Previous research studies have tackled the problem of surgical activity recognition by capitalizing on videos recorded during surgery [6][7][8][9][10][11][12]. Many of these works have proposed deep learning models to extract spatial and temporal information from videos. All these methods employ a convolutional neural network (CNN) for visual feature learning followed by hidden Markov models (HMMs) [6], recurrent neural networks (RNNs) [7], long short-term memory (LSTMs) [8], temporal convolutional networks (TCNs) [9,10], or transformers [11,12] for temporal feature learning. Although deep learning models have been successfully used for tackling activity recognition in surgeries, training these models requires large volumes of data and an arduous effort for selecting hyperparameters.
One of the most essential components to be considered while training these models is data augmentation. Data augmentation is a commonly used method to generate additional data for improving the training of data-intensive deep learning models for image classification [13][14][15], object detection [16,17], instance segmentation [16,18], etc. Additionally, augmentation has been shown to have an impact on model robustness [19] and performance on semi-supervised and self-supervised learning methods [20][21][22][23]. However, specific augmentation policies need to be designed to capture prior knowledge for each domain, which requires expertise and manual work, making data augmentation methods difficult to extend to other domains and applications [14,15,24]. To tackle the challenge of manually designing augmentation policies, the latest research papers have proposed reinforcement learning to learn optimal policies [14,15]. Recently, a simplified and more practical method (called RandAugment [24]) was proposed for addressing new difficulties, e.g., defining a proxy task and training on it, searching over 30 parameters, that arise with these automated data augmentation methods. Although the advances in automated data augmentation methods have been significant, these methods have been specifically developed for still images. Recently, a few augmentation methods specifically designed for video have been proposed [17,[25][26][27]. These methods have proposed inserting temporal perturbations successionally to the video frames [25] or objects (obtained through instance segmentation) from one video onto another [17]. A learningbased method has been proposed in [26] that finds a pair of similar videos and then places objects from one video onto another video's background. In [27], augmentation is applied to video frames ensuring smooth changes in its magnitude based on Fourier sampling. However, automated augmentation methods for videos have been unexplored.
In training video-based surgical activity recognition methods, previous works have used manually selected augmentations: horizontal flip [8,10], rotations [9,10], random cropping [8], translation [9], scale [9], and color jitter [10,12]. These specific augmentation policies have been applied at the image level to train backbone CNNs. On the other hand, no effort has been made to propose augmentation approaches for surgical videos. The temporal dimension in videos assumes particular importance in activity recognition as intraoperative surgical videos are of longer duration compared to videos examined in the computer vision community, and they capture the complete surgical procedure composed of multiple complex activities. This temporality present in both surgical videos and activities needs to be considered and exploited, while designing augmentation policies for training spatiotemporal models.
To this end, the paper introduces a new simplified and automated data augmentation method, called TRandAugment, that aims to incorporate the essential temporal dimension. Inspired by work [24], the TRandAugment method proposes a compact and simple parameterization consisting of only 3 parameters, where one parameter is dedicated to the temporal dimension. TRandAugment is extensively evaluated on the task of surgical activity recognition at two levels of granularity, i.e., phase and step [4], using two large surgical video datasets: Bypass40 [10] and CATARACTS [28].

Methodology
Automated activity recognition methods aim to segment endoscopic videos into surgical activities, i.e., phase or step. To improve the generalizability of activity recognition methods based on deep learning, this section introduces the proposed augmentation method, called TRandAugment, and the spatiotemporal model used to evaluate the method.

TRandAugment
The goal of TRandAugment is to incorporate the temporal dimension present in surgical videos into the data augmentation methods for improving the generalization of activity recognition models. In pursuing this goal, we also want to propose a simplified and automated data augmentation method. Given that a recent method [24] operates only on a two-parameter space (M, N ) compared to learned augmentation methods with over 30 parameters [14,15], TRandAugment is designed to require only 3 parameters, where the first two adopt the same parameterization used in [24], while the third additional parameter T is used to characterize the temporal dimension. Similar to previous Fig. 1 Pictographical representation of TRandAugment. A video is segmented into T clips, and a random augmentation t i , sampled from a list of transforms τ , is applied to clip i. The augmented clips are merged back to form a new video which is passed as input while training an end-to-end CNN+TCN network that predicts phases or steps TCN t 1~T T t i~T t T~T Time works [14,24], a set τ of 10 transformations is utilized and applied with uniform probability 1 |τ | : The choice of |τ | = 10 transformations is selected based on the domain knowledge of possible transformations that occur in endoscopic videos. Thus, we have excluded all the augmentations that, when applied, result in drastically different looking images that are highly unlikely to arise in surgical videos, such as posterize, solarize and equalize used in [24] and other novel augmentations proposed in the literature: YOCO [29], MixUp [30], CutMix [31] or AugMix [32].
As schematically represented in Fig. 1, the idea of TRan-dAugment is to apply different transformations to different temporal video segments. Thus, parameter T is introduced to control the number of temporal segments. Each video is split into a random T ∈ [1, T ] segments, and for each segment i (i ∈ [1, T ]), a random set of N transformation t i,1 , ..., t i,N | t i, j ∼ τ is applied uniformly on all the frames of that segment. The strength of each transformation is represented by magnitude M and linearly scaled between its minimum and maximum values mapped to an arbitrarily chosen integer scale from 0 to 30.
To maintain a notation consistent with previous methods, in particular [24], the proposed method is parameterized as (M, N , T ), where M and N are defined as the magnitude and number of transformations to apply per segment, and T is the maximum number of temporal segments.

Spatiotemporal model
The spatiotemporal model is comprised of ResNet-50 backbone, for visual feature learning, followed by a single-stage TCN (SS-TCN), for temporal modeling. The presented model is a powerful architecture comparable to other recent state-of-the-art methods [9][10][11][12]. Furthermore, it is modular and can easily accommodate new spatial and temporal models that could be proposed for activity recognition. This model is used in all the experiments and is trained end to end for the task of surgical activity recognition considering both phases and steps.
ResNet-50 [33] has been a popular model of choice in many recent works on phase/step recognition [8][9][10]34]. The model is also employed in this work for visual feature learning. For long temporal modeling, TCNs have been shown to outperform RNNs [9,10]. A single-stage model is employed over a multi-stage. This is motivated by the work of [10] where the multi-stage did not show improvements over the single-stage for both phase and step recognition. SS-TCN consists of only temporal convolutional layers that perform causal convolutions, which depend only on the current and n previous frames designed for online recognition.
The spatiotemporal model takes as input a video containing ϒ frames x 1:ϒ . ResNet-50 extracts visual features of size f = 2048 from 224 × 224 × 3 RGB images. The frame-wise features are stacked over time for the TCN model, which outputs predictionsŷ 1:ϒ , whereŷ i is the class label for the current timestamp i, i ∈ [1, ϒ]. Since both the tasks at hand (phase and step) are multi-class classification problems with imbalance in class distribution, following [9,10], a classweighted cross-entropy loss is used.

Bypass40 (BY40)
The Bypass40 dataset [10], courteously shared by the authors of [10], comprises 40 Laparoscopic Roux-en-Y gastric bypass (LRYGB) procedures with average video duration of different levels of granularity. A full list of all the phases and steps is presented in [10]. All the videos have a resolution of 854 × 480 or 1920 × 1080 pixels and are recorded at 25 fps. Following the same data split as [10], the dataset has been segregated into 24, 6, and 10 videos for training, validation, and test sets, respectively. The frames have been extracted at 1 fps and resized to ResNet-50's input size of 224 × 224.

CATARACTS (CA50)
The CATARACTS dataset 1 [28,35] consists of 50 videos of cataract surgical procedures. The dataset is annotated per frame with only steps as part of the CATARACTS2020 challenge. A complete list of all 19 steps is tabulated on the challenge website. 2 The 50 videos are split into 25, 5, and 20 subsets for training, validation, and test sets, respectively. Frames are extracted at 1 fps and resized from 1920 × 1080 to 224 × 224.

Training and evaluation
Baselines TRandAugment, or TRA, is compared against different baselines. RandAugment [24], referred to as RA, is the first comparison where the augmentations are applied independently for each image in a video. Next, RandAugment is extended to UniformRandAugment, called URA, where augmentation is applied uniformly on all the frames in a video. TRA is a more generalized method encapsulating both RA and URA, where setting T = 1 reduces TRA to URA, while T = ϒ (ϒ: number of frames in a video) transforms TRA to RA. Finally, all the methods are compared against the stateof-the-art MTMS-TCN [10] that used a manually designed 'Custom' set of augmentations (flip, saturation, rotation) for surgical activity recognition.

Training
In all the experiments, the ResNet-50 backbone model is initialized with ImageNet pretrained weights. Then, the complete ResNet-50 + SS-TCN model is trained in an end-to-end fashion for the task of phase/step recognition. To train the TCN, which requires temporal information, features from all the past frames in the video are cached by utilizing a feature buffer. This feature buffer is reset at the end of the video. The spatiotemporal model is trained for 50 epochs with a learning rate of 1e-5 and a batch size of 64. The proposed method and model have been implemented in PyTorch, and the experi- Bold values indicate the best performance ments (∼ 3500 GPU hours) were trained on NVIDIA RTX 6000 and V100 GPUs.

Evaluation
The effectiveness of the method is measured using accuracy (ACC), precision (PR), recall (RE), and F1-score (F1) metrics. The metrics are computed per video (averaged across classes) and are averaged across all the videos in the given set, following the same evaluation protocol as [9][10][11]23].

Results and discussion
In this section, we analyze the different components that influence the design of TRandAugment. Initially, we study the importance of temporally consistent augmentations in Sect. "Do temporally consistent augmentations matter?", then we analyze the impact of parameter M in Sect. "Effect of magnitude (M)", the number of transformations in Sect. "Do all augmentations help?" and impact of the parameter T in Sect. "Impact of parameter T on TRA". Finally, we present the performance of the proposed method considering the optimal parameters on both datasets (Sect. "TRandAugment").

Do temporally consistent augmentations matter?
One of the key differences between videos and images is the additional temporal dimension. An obvious question is to study the importance of temporally consistent augmentations when training models on videos. To study the effect of temporal consistency,  Bold values indicate the best performance than RA in all the settings. Furthermore, the mean of RA, when averaged across |τ | at both settings of M = {15, 30}, is ∼3-7% below the best-performing model compared to URA (∼1%). This indicates the instability of RA due to its policy of independent frame-wise augmentation, which breaks temporal visual consistency. Interestingly, the best RA model is obtained by utilizing a smaller set of augmentations |τ | = 3, which indicates that the model can learn significantly better when there is less variance in image appearance temporally. All the observations confirm that temporally consistent augmentations are important when training spatiotemporal models.

Effect of magnitude (M)
To study the effect of augmentation magnitude, Table 2  Irrespective of the augmentation method used, higher magnitudes seem to have a direct effect on the performance of the model for different tasks and datasets. However, we notice that TRA performance is below URA at M = 30. This is not a valid comparison as the other parameters |τ |, N , and T are fixed and sub-optimal. Hence, we perform these experiments to solely study the effect of magnitude on URA and TRA independently. The full comparison of TRA against other methods is discussed in Sect. "TRandAugment".

Do all augmentations help?
To study the importance of using all the augmentations, Table  3 lists different experiments in terms of F1-score on the validation set, with N = 1 and T = 5, where subsets of transforms (τ ⊂ τ : |τ | = {3, 5, 9}) are randomly sampled from τ . For the task of step recognition on both datasets, the best model performances are obtained when all transforms are utilized. On the other hand, the model performs best at an Bold values indicate the best performance

Impact of parameter T on TRA
The key component of the proposed TRA method is the parameter T that captures the variance in the appearance of the frames across a video. TRA is inspected with different settings of parameter T = {1, 3, 5, 8} at two different magnitudes M = {15, 30} while fixing N = 1 and |τ | = 10.
The results in Table 4 show that at T = 5, M = 30 the model achieves the best performance on all the different tasks and across the two datasets. This indicates that augmenting at the clip level benefits the training of activity recognition models and the proposed TRA parameterization (M, N , T ) allows us to easily find optimal parameters. Table 5 compares different augmentations methods on the test set with optimal parameters. As noticed earlier, tem- Step URA (ours) 10  Step TRA (ours) 10  Bold values indicate the best performance per dataset/task per metric porally consistent augmentations are beneficial, and hence, both URA and TRA, which enforce this consistency, outperform image-level augmentation method RA by 1-2% in F1 and ∼3% in accuracy for the task of step recognition on CATARACTS. Additionally, URA and TRA both show improvement over the state-of-the-art MTMS-TCN model, which utilized a 'Custom' set of augmentations by 1-5% across all the metrics for phase recognition on Bypass40.

TRandAugment
We can further notice a significant improvement of 5-11% across all the metrics for recognizing steps on Bypass40. This improvement could be attributed to the larger set of transforms |τ | = 10. TRA, on the other hand, outperforms URA on both the phase and step recognition tasks and both datasets. TRA achieves a 1-3% improvement in phase and step recognition on Bypass40 and CATARACTS, respectively. Moreover, for step recognition on Bypass40, TRA achieves a +2% and +1.5% improvement in precision and F1-score over URA. The performance improvement of the proposed TRA method over URA could be attributed to the temporally consistent augmentations applied at the clip level. TRA enables the extension of video datasets with videos composed of different segments augmented differently, which when used in training improves the generalization of deep learning models. Besides, the parameterization of TRA is independent of the underlying recognition task or dataset, which enables the proposed method to be applicable to other surgical procedures and tasks.

Limitations
The (M, N , T ) parameterization of TRandAugment simplifies the process of selecting a good augmentation policy, for training, that induces both spatial and temporal variations in the input videos. Yet, it does not completely eliminate the search for optimal parameters, which adds computational expense. Further studies are required to better understand if or when datasets or tasks may require a separate search to achieve optimal performance. Another drawback of TRan-dAugment is that it works only in the input space. Few works in the literature have proposed adding variations in the model's feature space to improve generalizability [36,37]. Unlike input space augmentations, designing feature space augmentations is extremely challenging because the domain or the noise characteristics of the feature space is not wellstudied. Nevertheless, this could be an interesting extension to our proposed method, especially for training the temporal component of spatiotemporal models.

Conclusion
This paper introduced a new augmentation method called TRandAugment that simplifies data augmentation pipelines. Given a video, creates pseudo-videos with different clips augmented differently. The method is parameterized with magnitude (M), the number of augments (N), and the number of temporal augments (T). This parameterization provides a simple framework to search for optimal configuration and operates at a level with significantly reduced search space, in line with current research in data augmentation. The proposed method has been validated on two large surgical video datasets, considering both the phase and step recognition tasks, obtaining a boost in the performances thus demonstrating the impact of TRandAugment. New open questions arise on how this method may improve model robustness [19], federated learning or semi-/self-supervised learning [20][21][22][23]34]. Furthermore, the proposed method could be applicable to other tasks, such as tool localization and tracking [38], action triplets [39], and video semantic segmentation [40]. Future work will study the value of TRandAugment in these different settings and tasks.
Acknowledgements The authors would like to thank Deepak Alapatt, Vinkle Srivastav, and Adit Murali for their helpful comments. This work was supported by the ATLAS project. The ATLAS project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 813782. This work was also supported by French state funds managed within the Investissements d'Avenir program by BPI France (project CONDOR) and by the ANR (ANR-16-CE33-0009, ANR-10-IAHU-02). This work was performed using HPC resources from GENCI-IDRIS (Grant 2021-AD011012832).
Funding Open access funding provided by Universitá degli Studi di Verona within the CRUI-CARE Agreement.

Declarations
Ethical Approval The surgical videos were recorded and anonymized following the informed consent of patients in compliance with the local institutional review board (IRB) requirements.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.