Learning laparoscopic video shot classification for gynecological surgery
- First Online:
- 406 Downloads
Videos of endoscopic surgery are used for education of medical experts, analysis in medical research, and documentation for everyday clinical life. Hand-crafted image descriptors lack the capabilities of a semantic classification of surgical actions and video shots of anatomical structures. In this work, we investigate how well single-frame convolutional neural networks (CNN) for semantic shot classification in gynecologic surgery work. Together with medical experts, we manually annotate hours of raw endoscopic gynecologic surgery videos showing endometriosis treatment and myoma resection of over 100 patients. The cleaned ground truth dataset comprises 9 h of annotated video material (from 111 different recordings). We use the well-known CNN architectures AlexNet and GoogLeNet and train these architectures for both, surgical actions and anatomy, from scratch. Furthermore, we extract high-level features from AlexNet with weights from a pre-trained model from the Caffe model zoo and feed them to an SVM classifier. Our evaluation shows that we reach an average recall of .697 and .515 for classification of anatomical structures and surgical actions respectively using off-the-shelf CNN features. Using GoogLeNet, we achieve a mean recall of .782 and .617 for classification of anatomical structures and surgical actions respectively. With AlexNet the achieved recall is .615 for anatomical structures and .469 for surgical action classification respectively. The main conclusion of our work is that advances in general image classification methods transfer to the domain of endoscopic surgery videos in gynecology. This is relevant as this domain is different from natural images, e.g. it is distinguished by smoke, reflections, or a limited amount of colors.
KeywordsVideo classification Deep learning Convolutional Neural Network
In recent years, endoscopic surgery procedures as well as imaging technology have advanced rapidly. These advances enable physicians to perform minimally invasive surgeries. As a side-effect, the recoded surgery videos benefit the surgeons’ work, as they provide a great basis for documentation, training of young surgeons, and medical research. Prior work supporting these aims has been conducted by our research group in the sector of endoscopic video analysis, such as a subjective quality assessment for the impact of compression on the perceived semantic quality , instrument classification in laparoscopic videos , or extraction and linking of endoscopic key-frames to videos [3, 23]. In this work, we restrict ourselves to a very specific field in minimally invasive surgery in the context of gynecology. In particular, we base our work on videos showing surgical treatment of myoma resection and endometriosis. Our aim is to lay a baseline for (semi-) automatic documentation for aforementioned surgical interventions. Therefore, we want to achieve semantic classification of video shots displaying surgical tasks and various anatomical structures relevant to gynecological surgery. Standard hand-crafted features lack the expressive power for use cases of high-level classification in this domain . On the contrary, CNNs have been successfully used for such problems in general image and video domains [7, 25]. Multiple models have been proposed for semantic classification of video shots, i.e. single frame, early fusion, late fusion, and slow fusion . The importance of deep learning in medical image analysis and content-based processing and analysis of endoscopic images and video also is apparent from the work of Litjens et al.  and Muenzer et al.  respectively.
How well do CNN-based single-frame models for semantic shot classification in the field of gynecological surgery, a special domain of laparoscopic surgery, perform?
This work is novel, as there is no comparison of different CNN models and SVM classifiers using CNN-extracted features for the use case of shot classification in gynecologic surgery. We expect that advances in the general domain transfer to our specialized use case, in particular we think that GoogLeNet achieves a better predictive performance than AlexNet. Furthermore, we expect that the off-the-shelf CNN features do not work as good for classification as the CNN models do. Another contribution of this work is a detailed discussion of important semantic content classes in the expert-domain of minimally invasive gynecologic surgery. This is relevant to colleagues working in the field of medical video analysis. The remainder of this paper is structured as follows. First, we discuss related work in medical imaging on the topics of computer-aided diagnosis, transfer learning, and semantic video classification. In Section 3, we describe the data annotation process as well as the data used for training and testing the CNN models and SVM. Details for learning are presented in Section 4. We evaluate the results in Section 5 and draw conclusions and outline possible future work in Section 6.
2 Related work
For the use case of classifying interstitial lung diseases, Li et al.  provide a simple CNN model containing a single convolutional layer. They yield per–class precision and recall between 0.8 and 0.9 for classification into five classes (normal, emphysema, ground glass, fibrosis, and micro-nodules) outperforming the SIFT feature as well as Restricted Boltzmann Machines. Anthimopoulos et al.  propose a deep CNN model containing five convolutional layers for the classification of CT images into seven classes of interstitial lung diseases (healthy, ground glass opacity, micronodules, consolidation, reticulation, and honeycombing). Their results imply that, for this use case, their CNN approach outperforms other CNNs as well as state-of-the-art methods using handcrafted features. In the work of Yan et al. , a multi–stage deep learning framework is presented. Using the proposed framework, the authors try to solve the problem of body-part recognition in MRI images. In total, they achieve best performance regarding recall, precision and f–score compared against logistic regression, SVMs, and CNNs. The importance of CNNs in medical applications is also apparent from their use within other applications such as nucleus segmentation , polyp detection in colonoscopy videos , microcalcification detection in digital breast tomosynthesis , mitosis detection in breast cancer histology , and short–term breast cancer risk prediction . Our work is delimited to the aforementioned research as in contrast to the classification of a state (e.g., healthy or consolidation, type of tissue), we aim at classifying both, anatomical structures and surgical actions. Furthermore, there haven’t been any efforts made regarding the classification of images extracted from laparoscopic surgery videos. Fine tuning and transfer learning effects of CNNs are covered in recent literature by Shin et al.  as well as Tajbakhsh et al. . These pieces of work are based on the use cases of lymph node detection, interstitial lung disease classification, polyp detection and image quality assessment in colonoscopy, pulmonary embolism detection in computed tomography images, and intima-media boundary segmentation in ultrasonographic images. Their results imply that CNNs are suitable for computer aided diagnosis problems, and transfer learning from large-scale annotated natural image datasets is beneficial for performance (which according to our preliminary studies does not apply to the problem of scene classification). For colonic polyp classification, Riberio et al.  proposed transfer learning using off-the-shelf CNN features. Based on high-level CNN features (from CNNs trained for object recognition), Ng et al.  use semantic fisher vectors for semantic classification of natural video scenes. Their results reach state-of-the-art performance on MIT Indoor and SUN datasets. For a large-scale YouTube video dataset, Karpathy et al.  give an overview on scene classification models based on CNNs, i.e. single frame, late fusion, early fusion and slow fusion. Their results imply that the naive single frame model (which is agnostic to temporal information)—despite it simplicity—already provides a strong performance. Ng et al.  compare single frame models for scene classification with slow fusion and LSTM-based models. In the domain of cataract surgery videos, Quellec et al.  propose a temporal segmentation and recognition of tasks. The temporal segmentation is based on the detection of idle phases, which is achieved by nearest neighbor search in a reference dataset. Primus et al.  provide a video segmentation for endoscopic surgeries based on analysis of spatial and temporal motion changes. For the use case of cholecystectomy, a special form of laparoscopic surgeries, Primus et al.  provide a rule-based method to temporally segment a surgery into different phases. The recognition of number and kind of used instruments (which is topic of their previous work ) act as main indication for a surgery phase. Shot boundary detection in cholecystectomy surgery videos using Gaussian Mixture Models and a Variational Bayesian Algorithm is investigated by Loukas et al. . The work of Twinanda et al.  also focuses on the use case cholecystectomy. They successfully apply CNNs, SVMs and HHMMs for detection of surgical phases. The envisioned classification is different from the use cases mentioned above, as in cholecystectomy there are predefined surgical phases, whereas in other fields of laparoscopic surgery (such as as gynaecology) there is no general consensus for such surgical phases. Moreover, we do not aim at defining shot boundaries. We provide the work most related to this by ourselves  in which we already have preformed an exploratory investigation of shot classification in the laparoscopic surgery domain. However, we did no distinction between surgical actions and anatomical structures which resulted in poor performances in the anatomical structure classes.
3 Laparoscopic gynecology video database
For this work, we analyze 111 different gynecological surgery videos. These videos contain scenes of laparoscopic endometriosis treatment and laparoscopic myoma resection and have a duration in the range of 20 min to 6 h. Analysis and discussion with medical experts for gynecology at the regional hospital (LKH) Villach (Austria) have resulted in the identification of two main aspects for the individual scenes: action and anatomy.
3.1 Annotation process
We derive the best matching class for a single shot implicitly by camera position and the current action, e.g., the action in the center of the image or the organ which is inspected by a surgeon is the action or object of interest. With the surgical action classes, there is the issue that a shot is likely to contain frames that could be classified as a diagnosis class as well. For example, suturing the ovary may contain images with the ovary without a surgical needle, or the suture is not clearly visible. On the one hand, this frame does not look like it belongs to a suturing shot, but on the other hand it indeed does belong to the suturing shot as the image has been recorded in its context. For the annotation of our dataset, we choose to stick to the latter case and annotate such frames as the surgical task by defining begin and end of the surgical action. Each frame from beginning until the end of a shot is labeled with the corresponding shot label for the class it belongs to. Due to this circumstance, the dataset also may contain blurry frames or frames in which instruments may cover huge parts of the camera. We argue that these frames are nonetheless part of the corresponding shot and thus correctly labeled. Prior to the annotation process, our annotators have been trained by medical experts. The annotations are cross-validated by a single annotator and trimmed in length or corrected when necessary. We do not filter blurry or irrelevant frames, as we are interested in a baseline evaluation without any preprocessing (except for resizing and center cropping) of the raw video frames. Thus, we leave the temporal dependencies within the annotated scenes intact.
3.2 Semantic content classes
- Suction & Irrigation.
These scenes feature the use of the suction and irrigation tube. Irrigation has the purpose to clean tissue in order to provide a clean field of view for the surgeon. Main visual feature is a ray of liquid. The suction action is quite the opposite to irrigation. It is used to absorb liquids. Classification problems in this class arise, whenever the suction and irrigation tube is used for positioning tissue or palpation.
The main characteristic of suturing scenes is the visible surgical needle and the suture. In general, the surgical needle can be of round or straight physical shape. During the process of suturing, the surgical needle often is only partially visible, if at all. The suture can vary in type, thickness, and color. An additional characteristic of these scenes is the use of the knot pusher, which is preceded by a scene where suture and low motion is visible.
- Cutting (cold).
Scenes of cold cutting, as annotated in this dataset, feature the separation of tissue with a sharp instrument, such as a scalpel or a scissor. Characteristic to this type of scenes is the use of multiple instruments: the instrument used for dissection itself (e.g. scissors) and grasper for fixation of tissue. This characterization applies to cutting and blunt dissection as well.
Cutting scenes show surgical separation of tissue by using electro-surgery technology such as mono-polar needles. Occasionally, a bright dot can be seen at the top of the instrument. A low to medium emission of smoke emerges from coagulated and separated tissue.
- Dissection (blunt).
Blunt dissection scenes feature the use of blunt instruments for the dissection of tissue. In our dataset, no specific tools can be bound to this action – the surgeon uses two or more blunt tools.
This class contains scenes of separation of the uterus for extraction. The electrical sling itself has an insulation which may look just like a special kind of suture. The coarse procedure of this surgical action is (i) introduction of the sling, (ii) positioning around the cervix, and eventually, (iii) thermal dissection. The thermal dissection features a significant amount of smoke. After this dissection, coagulation and suturing are required in general.
These type of scenes show coagulation by electro-surgical surgery methods. These scenes feature medium to high emission of smoke. The used instruments for this action do vary. For example, surgeons can use graspers or scissors which implies an additional difficulty for the classification of such scenes.
These scenes feature the injection of liquid into the patient’s tissue in order to minimize traumata. The injection needle is visible as thin straight piece of shiny rounded metal. The tissue around the tip of the needle typically inflates after the injection.
The uterus is the main organ of interest during myoma resection. In endometriosis treatment, the uterus can also be of interest in the adenomyosis disease pattern. The videos sequences of the class uterus feature an inspection of the uterus.
- Ovary and Oviduct
These classes are again of diagnostic nature. They feature image frames of clearly visible ovary. They are especially important for endometriosis disease and diagnosis of adhesions.
- Liver and Colon.
These two organs also are inspected during endometriosis diagnosis and treatment.
An overview on the annotated dataset with surgical actions: class id, class name, number of shots, number of frames, average duration in seconds, standard deviation of duration in seconds, and class description
Blunt dissection of tissue (e.g by tearing it apart)
Application of coagulation in order to close a wound
Dissect tissue with a sharp instrument (e.g. scissors)
Thermally dissect tissue (e.g. with monopular electrodes)
Dissection of large parts of tissue with an electrical sling
Injection with a needle
Suction & Irrigation
Application of the suction and irrigation tube
Process of suturing
An overview on the annotated dataset with anatomical structures: class id, class name, number of shots, number of frames, average duration in seconds, standard deviation of duration in seconds, and class description
Clearly visible colon
Clearly visible liver
Clearly visible ovary
Clearly visible oviduct
Clearly visible uterus
4 Frame-based shot classification
For this work, focus on the feasibility of endoscopic shot classification of laparoscopic surgery videos in gynecology with CNNs. Moreover, we investigate how end-to-end trained CNN with a problem-specific classification output layer perform against off-the-shelf CNN features.
Therefore, we use a single-frame scene classification model allowing us to investigate the influence of different network architectures and the quality of extracted high-level CNN features for the application of SVMs. We base our shot classification on two different network architectures: AlexNet  and GoogLeNet , which are designed for general purpose image classification and trained for the 1,000 classes of the ILSVRC dataset. AlexNet features input image patch sizes of 227 ×227 pixel. It consists of five convolutional layers, MAX pooling, local response normalization, dropout and three fully connected layers. The last fully connected layer is task-specific. Thus, for our experiments, the number of output neurons is altered to 5 and 8 output neurons for anatomy and action models respectively. Apart from this, the remaining network structure remained unaltered. The GoogLeNet architecture features inception modules with dimensionality reduction. In total, there are 22 parametrized layers and five pooling layers. Below the stacked inception modules (each reducing the image resolution) there is a convolutional low-level feature extraction expecting input patches of 224 ×224 pixels. The end of the network features a fully connected network. Analogous to the procedure with AlexNet, the network architecture remains unchanged except for the adaptation of the classification layer.
We prepare the video database for training and evaluation, which simply means that we extracted a square center crop of each video frame and then resized it to 256 ×256 pixel. Thus, we save computational resources for resizing and cropping at training time. We furthermore split the endoscopic video dataset into a test and a training set for each, anatomy and action images. For the split, we considered the test set to contain approximately 10% of the annotations. To ensure a diverse test set, we set a minimum number of images per class. For the anatomy subset this means that we included at least 500 unique frames per class in the test set and for action, we included at least 5,000 unique frames. The anatomy test set thus comprises 6,874 unique frames, the action test set comprises 57,205 unique frames. The remaining video frames are used to generate the test set. Please note that (as apparent from Tables 1 and 2) for both, action and anatomy subsets, the distribution of number of scenes and frames is highly imbalanced. For example, the action Suture is a frequent action and features long scene durations. We thus feature a high number of suturing frames in the database. On the other hand, there are actions such as Blunt Dissection featuring a very small number of unique frames. For the test set, this imbalanced distribution perfectly models our use case, as the frequently occurring classes are tested more thoroughly. For the training set, we eradicate this imbalance by a combination of undersampling (dropping frames randomly from the training set) and naive oversampling (duplicating frames randomly). To create the training set, we choose the number of training examples per class to 100,000 images for the action subset and 10,000 images for the anatomy subset. We define that classes containing more unique images than the training set size per class are overrepresented classes. Otherwise a class is underrepresented. For overrepresented classes, we (uniformly) randomly choose the corresponding number of images from the remaining images without returning the chosen images to the set we chose from. The data loss is negligible as we are dropping many near-duplicate images. For the underrepresented classes, we choose images with returning them to the set we chose from (uniformly) at random. We ensure that each annotated image is included in this process by pre-filling the training set with one image of each underrepresented class. This process resulted in 50,000 training images (generated from 33,732 unique images) for the anatomy model and 800,000 training images for the action model (generated from 486,771 unique images).
For the SVM learning process, we classifiy the training set with the AlexNet model with our weights and with off-the shelf weights which have been pre-trained for ImageNet classification. We extract feature vectors from three different locations of the network: the vector of class probabilites, the layer fc7, and the layer fc6 as input for SVM training and testing. For simplicity we refer to these vectors as class, fc7, and fc6 respectively. We use OpenCV’s C_SVC, which enables n-class classification with penalty multiplier for outliers. We do not set specific weights per class, thus we are treating misclassification of each class equally. This approach is reasonable, as we use a balanced training set. We use a linear SVM kernel, as this kernel worked best within preliminary studies. As termination criterion, we set the maximum number of iterations to 1,000 and the tolerance to 10−6.
For evaluation, we use the trained models of AlexNet and GoogLeNet architectures for action and anatomy classification as well as SVM classifiers trained on high-level CNN feature vectors fc6, fc7, and class from the AlexNet architecture. As weights, we use off-the-shelf weights that ære trained for ImageNet classification. In order to compare the predictive performance of the networks and the SVM approach, we use class-based precision and recall as well as average precision and average recall values over all classes. Evaluating precision and recall in a class-based manner has the advantage that the imbalance of the classes in the test set is taken into account. For the calculation of precision, recall, and f-value of class i, we determine TPi (true positive classification of class i), FPi (number of false positive predictions for class i), and FNi (number of false negative predictions of class i). We also calculate the probability that the true class is among the top three predictions. We refer to this probability as Recall@3, which we can not evaluate for the SVM approach as the OpenCV interface does not allow for that.
Overview on the validation data set
Suction & Irrigation
Detailed evaluation results for the action subset
Detailed evaluation results for the anatomy subset
On average, GoogLeNet achieves the best results for surgical action classification in terms of Recall, Precision, f-value and Precision@3. However, there are classes where other approaches work better. For example, AlexNet is better at the classification of Coagulation. We think that origins in the fact that tissue after coagulation and cutting with a monopolar needle device looks very similar and is distinguished by the used instruments only (which are not visible on each frame in the scenes and also appear frequently in other scenes). GoogLeNet interprets these instruments more likely to be contained in other scenes than AlexNet. The SVM approach using layer fc6 is better at classes Injection as well as Suction & Irrigation. These two classes are special, as they feature most reflections. We think that features from AlexNet trained on the ILSVRC dataset better map reflections as the models trained on a database where reflections occur constantly.
For anatomical structure classification, GoogLeNet also dominates the average performance in terms of Recall, Precision, f-value, and Precision@3. Interstingly, if we look at Recall@3, AlexNet slightly surpasses GoogLeNet at Colon, Ovaries, and Uterus classes. The other two classes, Oviduct and Liver are dominated by GoogLeNet. Considering the small number of anatomical structure classes, Recall@3 is not that expressive for the anatomy subset when the distances are that small as we observe them in the cases GoogLeNet performs worse than AlexNet. In terms of f-value, the combination of precision and recall, GoogleNet dominates in all but the Liver class, where the SVM approach using fc7 features dominates with a value of .909 compared to .879. The same approach yields good performance regarding recall for the class Uterus. With a value of .874, the features of fc6 layer also provide a good precision for Oviduct classification.
Our results further imply that introduction of an additional SVM classifier does not improve prediction results on average when introducing more sophisticated neural networks. This off-the-shelf feature approach looses performance in terms of recall per class and mean precision compared to GoogLeNet CNN. Interestingly, for actions, the more basic layer fc6 works better than the more abstract features fc7 and class achieving very poor performances. For anatomical structures, the layer fc7 works best out of the three evaluated features which are used as SVM input. We observe that the GoogLeNet architecture is superior to the AlexNet architecture and SVM Classifiers.
Hence, this gives a strong indication that improvements of CNN methods in the general domain of image classification lead to improvements in the specialized domain of laparoscopic surgery image classification. Also, off-the-shelf features from AlexNet and linear SVMs slightly outperform AlexNet training from scratch when the right layer is chosen. We think this originates in the training set. This set is correctly annotated, but not fully noise-free considering individual images. Comparing surgical action to anatomical structure classification performance, it is obvious that anatomical structures perform much better in overall performance. We think this originates in the very complex nature of surgical action scenes compared to more static scenes featuring anatomical structures and the agnostic of the temporal dimension.
CNN and SVM action models perform poorest in the classes Coagulation, Cutting Cold, and Suction & Irrigation. We think this originates in the fact that the single-frame CNN models have limited means to model the way the instruments are used. For CNN and SVM anatomy models, there is a bias to confuse the classes Ovaries, Uterus and Oviduct. We think this originates in the fact that these organs are spatially very near and when these organs are on the images, it is likely that parts of those other classes are visible as well.
In this paper, we investigate CNN.based single-frame classification models for video shots in gynecological surgery. Together with medical experts, we provide a first taxonomy for important anatomical structures and surgical actions of interest for the domain of laparoscopy videos in gynecology. For this domain, we build a dataset of 9 h of video data manually extracted from 111 different medical interventions. In particular, we train two different CNN architectures AlexNet and GoogLeNet from scratch for both, surgical action and anatomical structure classification. Furthermore, we investigate an SVM approach using off-the-shelf neural network features from AlexNet: class, fc7, and fc6. The best results from the SVM approach using features extracted from AlexNet using off-the-shelf weights outperform the full AlexNet CNN trained from scratch in both, anatomical structure as well as action classification which might originate in the choice to label the database scene-wise and not on a per-frame basis. Moreover, GoogLeNet, the best-performing approach on general images, also is the best performing approach in this domain. These results imply that advances in general image classification domains can lead to advances in difficult expert domains, such as our use case of gynecological surgery video classification.
Despite the fact that this domain is pretty narrow, there is plenty of future work to do. We think a per-pixel classification approach for anatomical structures could yield more accurate results for structures which are spatially near each other. More examples for future work include the evaluation of more sophisticated approaches for video classification, such as frame fusion models or LSTM-based models. Also, the question of whether we can surpass human performance by adding more network depth remains open. However, we think that classification of surgical actions provides the most benefit for surgeons and therefore focus on the following point. We assume that the capabilities of the used single-frame CNN models AlexNet and GoogLeNet are not fully utilized. Hence, we aim at an improvement of surgical action classification by using early fusion of raw image data with multiple (domain-specific) modalities of which at least one represents a temporal dimension, such as motion vectors.
Open access funding provided by University of Klagenfurt. This work was supported by Universität Klagenfurt and Lakeside Labs GmbH, Klagenfurt, Austria and funding from the European Regional Development Fund and the Carinthian Economic Promotion Fund (KWF) under grant KWF 20214 u. 3520/ 26336/38165.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.