1 Introduction

In recent years, endoscopic surgery procedures as well as imaging technology have advanced rapidly. These advances enable physicians to perform minimally invasive surgeries. As a side-effect, the recoded surgery videos benefit the surgeons’ work, as they provide a great basis for documentation, training of young surgeons, and medical research. Prior work supporting these aims has been conducted by our research group in the sector of endoscopic video analysis, such as a subjective quality assessment for the impact of compression on the perceived semantic quality [13], instrument classification in laparoscopic videos [17], or extraction and linking of endoscopic key-frames to videos [3, 23]. In this work, we restrict ourselves to a very specific field in minimally invasive surgery in the context of gynecology. In particular, we base our work on videos showing surgical treatment of myoma resection and endometriosis. Our aim is to lay a baseline for (semi-) automatic documentation for aforementioned surgical interventions. Therefore, we want to achieve semantic classification of video shots displaying surgical tasks and various anatomical structures relevant to gynecological surgery. Standard hand-crafted features lack the expressive power for use cases of high-level classification in this domain [2]. On the contrary, CNNs have been successfully used for such problems in general image and video domains [7, 25]. Multiple models have been proposed for semantic classification of video shots, i.e. single frame, early fusion, late fusion, and slow fusion [6]. The importance of deep learning in medical image analysis and content-based processing and analysis of endoscopic images and video also is apparent from the work of Litjens et al. [9] and Muenzer et al. [12] respectively.

As stated above, we aim at creating a baseline for semi-automatic documentation and therefore restrict ourselves to a single-frame model. Hence, the driving question behind our research is:

  • How well do CNN-based single-frame models for semantic shot classification in the field of gynecological surgery, a special domain of laparoscopic surgery, perform?

In order to answer the aforementioned question, we identify frequent surgical tasks and anatomical classes in cooperation with medical experts from the regional hospital (LKH) Villach in Austria. Based on this expert knowledge and over 100 video recordings of surgical treatments, we generate a data set with scenes of surgical actions and anatomical structures in gynecological surgery. The data set comprises 13 different semantic classes (five anatomy and eight action classes) and consists of about 9 h of annotated video material. Furthermore, we base our work on two well-known CNN architectures: AlexNet [7] and GoogLeNet [25]. For both subsets, surgical action and anatomy, we adapt the classification layer of the aforementioned networks, train the networks from scratch, and evaluate the predictive performance of the resulting networks. The division of action and anatomical structures is reasonable, as we employ a single label prediction model and surgical actions almost always show anatomical structures. We also evaluate the usage of high-level CNN features (from AlexNet classification as well as fully connected layers fc6 and fc7) for a multi-class SVM classifier in the domain of endoscopic surgery videos in gynecology.

This work is novel, as there is no comparison of different CNN models and SVM classifiers using CNN-extracted features for the use case of shot classification in gynecologic surgery. We expect that advances in the general domain transfer to our specialized use case, in particular we think that GoogLeNet achieves a better predictive performance than AlexNet. Furthermore, we expect that the off-the-shelf CNN features do not work as good for classification as the CNN models do. Another contribution of this work is a detailed discussion of important semantic content classes in the expert-domain of minimally invasive gynecologic surgery. This is relevant to colleagues working in the field of medical video analysis. The remainder of this paper is structured as follows. First, we discuss related work in medical imaging on the topics of computer-aided diagnosis, transfer learning, and semantic video classification. In Section 3, we describe the data annotation process as well as the data used for training and testing the CNN models and SVM. Details for learning are presented in Section 4. We evaluate the results in Section 5 and draw conclusions and outline possible future work in Section 6.

2 Related work

For the use case of classifying interstitial lung diseases, Li et al. [8] provide a simple CNN model containing a single convolutional layer. They yield per–class precision and recall between 0.8 and 0.9 for classification into five classes (normal, emphysema, ground glass, fibrosis, and micro-nodules) outperforming the SIFT feature as well as Restricted Boltzmann Machines. Anthimopoulos et al. [2] propose a deep CNN model containing five convolutional layers for the classification of CT images into seven classes of interstitial lung diseases (healthy, ground glass opacity, micronodules, consolidation, reticulation, and honeycombing). Their results imply that, for this use case, their CNN approach outperforms other CNNs as well as state-of-the-art methods using handcrafted features. In the work of Yan et al. [29], a multi–stage deep learning framework is presented. Using the proposed framework, the authors try to solve the problem of body-part recognition in MRI images. In total, they achieve best performance regarding recall, precision and f–score compared against logistic regression, SVMs, and CNNs. The importance of CNNs in medical applications is also apparent from their use within other applications such as nucleus segmentation [28], polyp detection in colonoscopy videos [15], microcalcification detection in digital breast tomosynthesis [22], mitosis detection in breast cancer histology [1], and short–term breast cancer risk prediction [19]. Our work is delimited to the aforementioned research as in contrast to the classification of a state (e.g., healthy or consolidation, type of tissue), we aim at classifying both, anatomical structures and surgical actions. Furthermore, there haven’t been any efforts made regarding the classification of images extracted from laparoscopic surgery videos. Fine tuning and transfer learning effects of CNNs are covered in recent literature by Shin et al. [24] as well as Tajbakhsh et al. [26]. These pieces of work are based on the use cases of lymph node detection, interstitial lung disease classification, polyp detection and image quality assessment in colonoscopy, pulmonary embolism detection in computed tomography images, and intima-media boundary segmentation in ultrasonographic images. Their results imply that CNNs are suitable for computer aided diagnosis problems, and transfer learning from large-scale annotated natural image datasets is beneficial for performance (which according to our preliminary studies does not apply to the problem of scene classification). For colonic polyp classification, Riberio et al. [21] proposed transfer learning using off-the-shelf CNN features. Based on high-level CNN features (from CNNs trained for object recognition), Ng et al. [4] use semantic fisher vectors for semantic classification of natural video scenes. Their results reach state-of-the-art performance on MIT Indoor and SUN datasets. For a large-scale YouTube video dataset, Karpathy et al. [6] give an overview on scene classification models based on CNNs, i.e. single frame, late fusion, early fusion and slow fusion. Their results imply that the naive single frame model (which is agnostic to temporal information)—despite it simplicity—already provides a strong performance. Ng et al. [30] compare single frame models for scene classification with slow fusion and LSTM-based models. In the domain of cataract surgery videos, Quellec et al. [20] propose a temporal segmentation and recognition of tasks. The temporal segmentation is based on the detection of idle phases, which is achieved by nearest neighbor search in a reference dataset. Primus et al. [11] provide a video segmentation for endoscopic surgeries based on analysis of spatial and temporal motion changes. For the use case of cholecystectomy, a special form of laparoscopic surgeries, Primus et al. [18] provide a rule-based method to temporally segment a surgery into different phases. The recognition of number and kind of used instruments (which is topic of their previous work [17]) act as main indication for a surgery phase. Shot boundary detection in cholecystectomy surgery videos using Gaussian Mixture Models and a Variational Bayesian Algorithm is investigated by Loukas et al. [10]. The work of Twinanda et al. [27] also focuses on the use case cholecystectomy. They successfully apply CNNs, SVMs and HHMMs for detection of surgical phases. The envisioned classification is different from the use cases mentioned above, as in cholecystectomy there are predefined surgical phases, whereas in other fields of laparoscopic surgery (such as as gynaecology) there is no general consensus for such surgical phases. Moreover, we do not aim at defining shot boundaries. We provide the work most related to this by ourselves [16] in which we already have preformed an exploratory investigation of shot classification in the laparoscopic surgery domain. However, we did no distinction between surgical actions and anatomical structures which resulted in poor performances in the anatomical structure classes.

3 Laparoscopic gynecology video database

For this work, we analyze 111 different gynecological surgery videos. These videos contain scenes of laparoscopic endometriosis treatment and laparoscopic myoma resection and have a duration in the range of 20 min to 6 h. Analysis and discussion with medical experts for gynecology at the regional hospital (LKH) Villach (Austria) have resulted in the identification of two main aspects for the individual scenes: action and anatomy.

Anatomy

This type of video scene features little or almost no surgical actions apart from moving tissue and organs. Purpose of diagnosis scenes is the assessment of pathologies on specific organs, such as ovaries, uterus, or liver. Hence, diagnosis scenes are relevant for documentation purposes of the disease as well as its treatment. These scenes are important for medical research and teaching purposes. A second aspect of diagnosis scenes is to document the treatment outcome, i.e. which actions are performed, or how the tissue after treatment looks like. Additionally to disease treatment documentation, diagnosis scenes can be valuable whenever postoperative complications occur. According to our use case of myoma resection and endometriosis treatment, we identify the following (sub-) classes as diagnosis scenes of interest: Uterus, Ovaries, Oviduct, Liver and Colon. Please note that this list of classes is no comprehensive list of anatomical structures visible in the surgery videos, but it covers the most important organs which are encountered during surgical treatment. For an overview on anatomical structure classes, please refer to Fig. 1.

Fig. 1
figure 1

An overview on anatomical structure classes uterus, ovary, oviduct, liver, and colon. The frames are extracted from the annotated data set

Action

The class of surgical action video scenes feature significant interaction with the patient’s tissue and organs using a variety of different surgical instruments. These scenes represent the main physical work for the surgeon. Their automatic classification is relevant for documentation and even more for teaching purposes of certain operation techniques. The main aspect of these scenes is the use of medical instruments, e.g. suction & irrigation device, graspers, monopolar needles, needleholders, or scissors. We identify the (sub-) classes Suction & Irrigation, Suture, Dissection (blunt), Cutting, Cutting (cold), Sling, Coagulation, and Injection as the most common surgical actions during laparoscopic endometriosis treatment and myoma resection in our dataset (see Fig. 2). Of course, there are several other actions to be performed, such as tissue extraction, or stapling, but as mentioned before, we are interested in the most common and most important actions.

Fig. 2
figure 2

An overview on surgical actions coagulation, sling, injection, irrigation, suture, cutting (cold), cutting, and blunt dissection. The frames are extracted from the annotated data set

3.1 Annotation process

We derive the best matching class for a single shot implicitly by camera position and the current action, e.g., the action in the center of the image or the organ which is inspected by a surgeon is the action or object of interest. With the surgical action classes, there is the issue that a shot is likely to contain frames that could be classified as a diagnosis class as well. For example, suturing the ovary may contain images with the ovary without a surgical needle, or the suture is not clearly visible. On the one hand, this frame does not look like it belongs to a suturing shot, but on the other hand it indeed does belong to the suturing shot as the image has been recorded in its context. For the annotation of our dataset, we choose to stick to the latter case and annotate such frames as the surgical task by defining begin and end of the surgical action. Each frame from beginning until the end of a shot is labeled with the corresponding shot label for the class it belongs to. Due to this circumstance, the dataset also may contain blurry frames or frames in which instruments may cover huge parts of the camera. We argue that these frames are nonetheless part of the corresponding shot and thus correctly labeled. Prior to the annotation process, our annotators have been trained by medical experts. The annotations are cross-validated by a single annotator and trimmed in length or corrected when necessary. We do not filter blurry or irrelevant frames, as we are interested in a baseline evaluation without any preprocessing (except for resizing and center cropping) of the raw video frames. Thus, we leave the temporal dependencies within the annotated scenes intact.

3.2 Semantic content classes

Due to legal restrictions, we are not able to publish the used dataset. In order to allow for partial repeatability, we give a detailed explanation of the individual classes in the following.

Suction & Irrigation. :

These scenes feature the use of the suction and irrigation tube. Irrigation has the purpose to clean tissue in order to provide a clean field of view for the surgeon. Main visual feature is a ray of liquid. The suction action is quite the opposite to irrigation. It is used to absorb liquids. Classification problems in this class arise, whenever the suction and irrigation tube is used for positioning tissue or palpation.

Suture.:

The main characteristic of suturing scenes is the visible surgical needle and the suture. In general, the surgical needle can be of round or straight physical shape. During the process of suturing, the surgical needle often is only partially visible, if at all. The suture can vary in type, thickness, and color. An additional characteristic of these scenes is the use of the knot pusher, which is preceded by a scene where suture and low motion is visible.

Cutting (cold). :

Scenes of cold cutting, as annotated in this dataset, feature the separation of tissue with a sharp instrument, such as a scalpel or a scissor. Characteristic to this type of scenes is the use of multiple instruments: the instrument used for dissection itself (e.g. scissors) and grasper for fixation of tissue. This characterization applies to cutting and blunt dissection as well.

Cutting. :

Cutting scenes show surgical separation of tissue by using electro-surgery technology such as mono-polar needles. Occasionally, a bright dot can be seen at the top of the instrument. A low to medium emission of smoke emerges from coagulated and separated tissue.

Dissection (blunt). :

Blunt dissection scenes feature the use of blunt instruments for the dissection of tissue. In our dataset, no specific tools can be bound to this action – the surgeon uses two or more blunt tools.

Sling. :

This class contains scenes of separation of the uterus for extraction. The electrical sling itself has an insulation which may look just like a special kind of suture. The coarse procedure of this surgical action is (i) introduction of the sling, (ii) positioning around the cervix, and eventually, (iii) thermal dissection. The thermal dissection features a significant amount of smoke. After this dissection, coagulation and suturing are required in general.

Coagulation. :

These type of scenes show coagulation by electro-surgical surgery methods. These scenes feature medium to high emission of smoke. The used instruments for this action do vary. For example, surgeons can use graspers or scissors which implies an additional difficulty for the classification of such scenes.

Injection. :

These scenes feature the injection of liquid into the patient’s tissue in order to minimize traumata. The injection needle is visible as thin straight piece of shiny rounded metal. The tissue around the tip of the needle typically inflates after the injection.

Uterus. :

The uterus is the main organ of interest during myoma resection. In endometriosis treatment, the uterus can also be of interest in the adenomyosis disease pattern. The videos sequences of the class uterus feature an inspection of the uterus.

Ovary and Oviduct :

These classes are again of diagnostic nature. They feature image frames of clearly visible ovary. They are especially important for endometriosis disease and diagnosis of adhesions.

Liver and Colon.:

These two organs also are inspected during endometriosis diagnosis and treatment.

Out of 111 raw gynecological surgery videos, we manually annotated 1,105 shots consisting of 822,918 different video frames resulting in about 9 h of annotated video scenes. As already mentioned, the annotators have been trained by medical experts and the annotated scenes have been checked partly by the experts. Tables 1 and 2 give an overview on the annotated medical video database including class ID, class name, and short semantic description for each action and anatomy class. Moreover, they contain information about the distribution of annotations on a per-class basis, i.e. number of annotated shots, number of annotated frames, average scene duration, and standard deviation. Most frequent actions observed in this dataset are Suction and Irrigation, Coagulation, and Cutting (Cold). Suture is the leading class in terms of annotated video duration. On average, suturing scenes have longest duration, scenes of Cutting (Cold) are the shortest. The variance within the individual classes arises from surgery circumstances, such as intervention complications, or patient anatomy. Due to the high variance of video sequence length (class–wise compared to average duration), no statistically significant conclusions can be drawn from the individual scene length.

Table 1 An overview on the annotated dataset with surgical actions: class id, class name, number of shots, number of frames, average duration in seconds, standard deviation of duration in seconds, and class description
Table 2 An overview on the annotated dataset with anatomical structures: class id, class name, number of shots, number of frames, average duration in seconds, standard deviation of duration in seconds, and class description

4 Frame-based shot classification

For this work, focus on the feasibility of endoscopic shot classification of laparoscopic surgery videos in gynecology with CNNs. Moreover, we investigate how end-to-end trained CNN with a problem-specific classification output layer perform against off-the-shelf CNN features.

Therefore, we use a single-frame scene classification model allowing us to investigate the influence of different network architectures and the quality of extracted high-level CNN features for the application of SVMs. We base our shot classification on two different network architectures: AlexNet [7] and GoogLeNet [25], which are designed for general purpose image classification and trained for the 1,000 classes of the ILSVRC dataset. AlexNet features input image patch sizes of 227 ×227 pixel. It consists of five convolutional layers, MAX pooling, local response normalization, dropout and three fully connected layers. The last fully connected layer is task-specific. Thus, for our experiments, the number of output neurons is altered to 5 and 8 output neurons for anatomy and action models respectively. Apart from this, the remaining network structure remained unaltered. The GoogLeNet architecture features inception modules with dimensionality reduction. In total, there are 22 parametrized layers and five pooling layers. Below the stacked inception modules (each reducing the image resolution) there is a convolutional low-level feature extraction expecting input patches of 224 ×224 pixels. The end of the network features a fully connected network. Analogous to the procedure with AlexNet, the network architecture remains unchanged except for the adaptation of the classification layer.

We prepare the video database for training and evaluation, which simply means that we extracted a square center crop of each video frame and then resized it to 256 ×256 pixel. Thus, we save computational resources for resizing and cropping at training time. We furthermore split the endoscopic video dataset into a test and a training set for each, anatomy and action images. For the split, we considered the test set to contain approximately 10% of the annotations. To ensure a diverse test set, we set a minimum number of images per class. For the anatomy subset this means that we included at least 500 unique frames per class in the test set and for action, we included at least 5,000 unique frames. The anatomy test set thus comprises 6,874 unique frames, the action test set comprises 57,205 unique frames. The remaining video frames are used to generate the test set. Please note that (as apparent from Tables 1 and 2) for both, action and anatomy subsets, the distribution of number of scenes and frames is highly imbalanced. For example, the action Suture is a frequent action and features long scene durations. We thus feature a high number of suturing frames in the database. On the other hand, there are actions such as Blunt Dissection featuring a very small number of unique frames. For the test set, this imbalanced distribution perfectly models our use case, as the frequently occurring classes are tested more thoroughly. For the training set, we eradicate this imbalance by a combination of undersampling (dropping frames randomly from the training set) and naive oversampling (duplicating frames randomly). To create the training set, we choose the number of training examples per class to 100,000 images for the action subset and 10,000 images for the anatomy subset. We define that classes containing more unique images than the training set size per class are overrepresented classes. Otherwise a class is underrepresented. For overrepresented classes, we (uniformly) randomly choose the corresponding number of images from the remaining images without returning the chosen images to the set we chose from. The data loss is negligible as we are dropping many near-duplicate images. For the underrepresented classes, we choose images with returning them to the set we chose from (uniformly) at random. We ensure that each annotated image is included in this process by pre-filling the training set with one image of each underrepresented class. This process resulted in 50,000 training images (generated from 33,732 unique images) for the anatomy model and 800,000 training images for the action model (generated from 486,771 unique images).

For implementation of the machine learning approaches (CNN and SVM), we use Caffe [5] and OpenCV [14]. At training time, we feed the network image patches of its expected size (224 pixel squares for GoogLeNet, 227 pixel squares for AlexNet). These image patches are crops chosen at random from the training images featuring a size of 256 ×256 pixels. As additional data augmentation, we also use Caffe’s mirror feature at training time. For optimization, we use the Adam optimization method with initial learning rate of 0.001 and momentum parameters 0.9 and .999 Other hyperparameters like weight decay are not altered from their respective values as shipped with the AlexNet and GoogLeNet models. The training is performed on a machine featuring an Intel(R) Core(TM) i7-5960X CPU 3.00GHz processor, 64GB of DDR-4 RAM, a Samsung SSD 850 pro and a NVIDIA GeForce GTX TITAN X graphics card. For AlexNet, we use a batch size of 100 images per batch. For GoogLeNet the batch size is set to 50 images per batch. For both, AlexNet and GoogLeNet, we train action and anatomy models from scratch. This system takes approximately ten days for training of all models and SVMs. The training loss and validation performance of the CNNs is depicted in Fig. 3 for the anatomy models and in Fig. 4 for the action models. The x-axis shows the training epoch. The y-axis shows loss and accuracy respectively. At each epoch, we measure average loss of the epoch and validation performance. For the anatomy models, the loss and accuracy curves bottom out after approximately 10 epochs. In the surgical action models, the training loss for the GoogLeNet network rises after 2 epochs. Longer training of AlexNet has the same effect. Also the accuracy of the model drops with higher numbers of epoch. We think this behavior origins in overfitting. For anatomical structures, this is less a problem as the individual classes are less diverse. We select the models for evaluation with respect to least train loss and highest training accuracy.

Fig. 3
figure 3

Loss and accuracy for anatomy models based on AlexNet and GoogLeNet CNN architectures for 50 epochs

Fig. 4
figure 4

Loss and accuracy for action models based on AlexNet and GoogLeNet CNN architectures for the first 15 epochs

For the SVM learning process, we classifiy the training set with the AlexNet model with our weights and with off-the shelf weights which have been pre-trained for ImageNet classification. We extract feature vectors from three different locations of the network: the vector of class probabilites, the layer fc7, and the layer fc6 as input for SVM training and testing. For simplicity we refer to these vectors as class, fc7, and fc6 respectively. We use OpenCV’s C_SVC, which enables n-class classification with penalty multiplier for outliers. We do not set specific weights per class, thus we are treating misclassification of each class equally. This approach is reasonable, as we use a balanced training set. We use a linear SVM kernel, as this kernel worked best within preliminary studies. As termination criterion, we set the maximum number of iterations to 1,000 and the tolerance to 10−6.

5 Evaluation

For evaluation, we use the trained models of AlexNet and GoogLeNet architectures for action and anatomy classification as well as SVM classifiers trained on high-level CNN feature vectors fc6, fc7, and class from the AlexNet architecture. As weights, we use off-the-shelf weights that ære trained for ImageNet classification. In order to compare the predictive performance of the networks and the SVM approach, we use class-based precision and recall as well as average precision and average recall values over all classes. Evaluating precision and recall in a class-based manner has the advantage that the imbalance of the classes in the test set is taken into account. For the calculation of precision, recall, and f-value of class i, we determine T P i (true positive classification of class i), F P i (number of false positive predictions for class i), and F N i (number of false negative predictions of class i). We also calculate the probability that the true class is among the top three predictions. We refer to this probability as Recall@3, which we can not evaluate for the SVM approach as the OpenCV interface does not allow for that.

For the evaluation, we create an own validation set consisting of approximately 70,000 frames by choosing five representational scenes per class. Please note that these scenes are neither in the training nor in the test set. Thus, this additional set validates the generalization capabilities of the approaches. The validation set size for action and anatomy is 50,988 and 21,568 images respectively. For a class distribution within the validation set, please refer to Table 3.

Table 3 Overview on the validation data set

For a detailed and class-based performance overview, please consult Table 4 for the surgical action classification and Table 5 for anatomical structure classification.

Table 4 Detailed evaluation results for the action subset
Table 5 Detailed evaluation results for the anatomy subset

On average, GoogLeNet achieves the best results for surgical action classification in terms of Recall, Precision, f-value and Precision@3. However, there are classes where other approaches work better. For example, AlexNet is better at the classification of Coagulation. We think that origins in the fact that tissue after coagulation and cutting with a monopolar needle device looks very similar and is distinguished by the used instruments only (which are not visible on each frame in the scenes and also appear frequently in other scenes). GoogLeNet interprets these instruments more likely to be contained in other scenes than AlexNet. The SVM approach using layer fc6 is better at classes Injection as well as Suction & Irrigation. These two classes are special, as they feature most reflections. We think that features from AlexNet trained on the ILSVRC dataset better map reflections as the models trained on a database where reflections occur constantly.

For anatomical structure classification, GoogLeNet also dominates the average performance in terms of Recall, Precision, f-value, and Precision@3. Interstingly, if we look at Recall@3, AlexNet slightly surpasses GoogLeNet at Colon, Ovaries, and Uterus classes. The other two classes, Oviduct and Liver are dominated by GoogLeNet. Considering the small number of anatomical structure classes, Recall@3 is not that expressive for the anatomy subset when the distances are that small as we observe them in the cases GoogLeNet performs worse than AlexNet. In terms of f-value, the combination of precision and recall, GoogleNet dominates in all but the Liver class, where the SVM approach using fc7 features dominates with a value of .909 compared to .879. The same approach yields good performance regarding recall for the class Uterus. With a value of .874, the features of fc6 layer also provide a good precision for Oviduct classification.

Our results further imply that introduction of an additional SVM classifier does not improve prediction results on average when introducing more sophisticated neural networks. This off-the-shelf feature approach looses performance in terms of recall per class and mean precision compared to GoogLeNet CNN. Interestingly, for actions, the more basic layer fc6 works better than the more abstract features fc7 and class achieving very poor performances. For anatomical structures, the layer fc7 works best out of the three evaluated features which are used as SVM input. We observe that the GoogLeNet architecture is superior to the AlexNet architecture and SVM Classifiers.

Fig. 5
figure 5

Confusion matrices for action models based on AlexNet and GoogLeNet CNN architectures

Hence, this gives a strong indication that improvements of CNN methods in the general domain of image classification lead to improvements in the specialized domain of laparoscopic surgery image classification. Also, off-the-shelf features from AlexNet and linear SVMs slightly outperform AlexNet training from scratch when the right layer is chosen. We think this originates in the training set. This set is correctly annotated, but not fully noise-free considering individual images. Comparing surgical action to anatomical structure classification performance, it is obvious that anatomical structures perform much better in overall performance. We think this originates in the very complex nature of surgical action scenes compared to more static scenes featuring anatomical structures and the agnostic of the temporal dimension.

We visualize the performance for our surgical action and anatomical structure classes of the individual approaches using confusion matrices depicted in Fig. 5 for the CNN action models, and Fig. 6 for CNN anatomy models. SVM confusion matrices are given in Fig. 7 for action classification, and Fig. 8 for anatomy classification. Columns denote the predicted class while rows indicate the true class. Cell shades illustrate prediction percentage relative to the number of examples for a class.

Fig. 6
figure 6

Confusion matrices for anatomy models based on AlexNet and GoogLeNet CNN architectures

Fig. 7
figure 7

Confusion matrices for SVM action classification using AlexNet features

Fig. 8
figure 8

Confusion matrices for SVM anatomy classification using AlexNet features

CNN and SVM action models perform poorest in the classes Coagulation, Cutting Cold, and Suction & Irrigation. We think this originates in the fact that the single-frame CNN models have limited means to model the way the instruments are used. For CNN and SVM anatomy models, there is a bias to confuse the classes Ovaries, Uterus and Oviduct. We think this originates in the fact that these organs are spatially very near and when these organs are on the images, it is likely that parts of those other classes are visible as well.

6 Conclusion

In this paper, we investigate CNN.based single-frame classification models for video shots in gynecological surgery. Together with medical experts, we provide a first taxonomy for important anatomical structures and surgical actions of interest for the domain of laparoscopy videos in gynecology. For this domain, we build a dataset of 9 h of video data manually extracted from 111 different medical interventions. In particular, we train two different CNN architectures AlexNet and GoogLeNet from scratch for both, surgical action and anatomical structure classification. Furthermore, we investigate an SVM approach using off-the-shelf neural network features from AlexNet: class, fc7, and fc6. The best results from the SVM approach using features extracted from AlexNet using off-the-shelf weights outperform the full AlexNet CNN trained from scratch in both, anatomical structure as well as action classification which might originate in the choice to label the database scene-wise and not on a per-frame basis. Moreover, GoogLeNet, the best-performing approach on general images, also is the best performing approach in this domain. These results imply that advances in general image classification domains can lead to advances in difficult expert domains, such as our use case of gynecological surgery video classification.

Despite the fact that this domain is pretty narrow, there is plenty of future work to do. We think a per-pixel classification approach for anatomical structures could yield more accurate results for structures which are spatially near each other. More examples for future work include the evaluation of more sophisticated approaches for video classification, such as frame fusion models or LSTM-based models. Also, the question of whether we can surpass human performance by adding more network depth remains open. However, we think that classification of surgical actions provides the most benefit for surgeons and therefore focus on the following point. We assume that the capabilities of the used single-frame CNN models AlexNet and GoogLeNet are not fully utilized. Hence, we aim at an improvement of surgical action classification by using early fusion of raw image data with multiple (domain-specific) modalities of which at least one represents a temporal dimension, such as motion vectors.