1 Introduction

In the last decades, medical endoscopy has emerged as key technology for minimally-invasive examinations in numerous body regions and for minimally-invasive surgery in the abdomen, joints and further body regions. The term “endoscopy” derives from Greek and refers to methods to “look inside” the human body in a minimally invasive way. This is accomplished by inserting a medical device called endoscope into the interior of a hollow organ or body cavity. Depending on the respective body region the insertion is performed through a natural body orifice (e.g., for examination of the esophagus or the bowel) or through a small incision that serves as artificial entrance. For surgical procedures (e.g., removal of the gallbladder), additional incisions are required to insert various surgical instruments. Compared to open surgery this still causes much less trauma, which is one of the main advantages of minimally invasive surgery (also known as buttonhole or keyhole surgery). Many medical procedures were revolutionized by the introduction of endoscopy, some were even enabled by this technology in the first place. For a more detailed insight into the history of endoscopy, please refer to [113] and [96].

Endoscopy is an umbrella term for a variety of very diverse medical methods. There are many sub-types of endoscopy, which have very different characteristics. They can be classified according to several criteria, e.g.,

  • body region (e.g., abdomen, joints, gastrointestinal tract, lungs, chest, nose)

  • medical speciality (e.g., general surgery, gastroenterology, orthopedic surgery)

  • diagnostic vs. therapeutical focus

  • flexible vs. rigid construction form

Unfortunately, the usage of the term “endoscopy” is neither consistent in common parlance nor in the literature. It is often used as synonym for gastro-intestinal endoscopy (of the digestive system), which mainly includes colonoscopy and gastroscopy (examination of the colon and stomach, respectively). A further special type is Wireless Capsule Endoscopy (WCE). The patient has to swallow a small capsule that includes a tiny camera and transmits a large number of images to an external receiver while it travels through the digestive tract for several hours. The images are then assessed by the physician after the end of this process. WCE is especially important for examinations of the small intestine because neither gastroscopy nor colonoscopy can access this part of the gastrointestinal tract.

The major therapeutic sub-types are laparoscopy (procedures in the abdominal cavity) and arthroscopy (orthopedic procedures on joints, mainly knee and shoulder). They are often subsumed under the term “minimally invasive surgery”. Laparoscopic operations span different medical specialities, particularly general surgery, pediatric surgery, gynecology and urology. Examples for common laparoscopic operations are cholecystectomy (removal of the gall bladder), nephrectomy (removal of the kidney), prostatectomy (removal of the prostate gland) and the diagnosis and treatment of endometriosis. Further important endoscopy types are thoracoscopy (thorax/chest), bronchoscopy (airways), cystoscopy (bladder), hysteroscopy (uterus) and further special procedures in the field of ENT (ear, nose, throat) and neurosurgery (brain).

In the course of an endoscopic procedure, a video signal is produced by the endoscopic camera and visualized to the surgical team to guide their actions. This inherently available video signal is predestinated for automatic content analysis in order to assist the physician. Hence, numerous research communities proposed methods to process and analyze it, either in real-time or for post-procedural usage. In both cases, image processing techniques are often used to pre-process individual video frames, be it to improve the performance of subsequent processing steps or to simply improve their visual quality. Pattern recognition and machine learning methods are used to detect lesions, polyps, tumors etc. in order to aid physicians in the diagnostic analysis. The robotics community applies Computer Vision algorithms for 3D reconstruction of the inner anatomical structure in combination with detection and tracking of operation instruments to enable robot-assisted surgery. In the context of Augmented Reality, endoscopic images are registered to pre-operative CT or MRI scans to provide guidance and additional context-specific information to the surgeon during the operation. For readers who want to learn more about the workflow in this field, we recommend the following tutorial papers [151, 152].

In recent years, we can observe a growing trend to record and store videos of endoscopic procedures, mainly for medical documentation and research. This new paradigm of video documentation has many advantages: it enables to revisit the procedure anytime, it facilitates detailed discussions with colleagues as well as explanations to patients, it allows for better planning of follow-up operations, and it is a great source of information for research, training, education and quality assurance. The benefits of video documentation have been confirmed in numerous studies, e.g., [61, 100, 101, 137]. However, physicians can only benefit from endoscopic videos if they are easily accessible. This is where research in the Multimedia field comes in. Well researched methods like content-based video retrieval, video segmentation, summarization, efficient storage and archiving concepts as well as efficient video interaction and browsing interfaces can be used to organize an endoscopic video archive and make it accessible for physicians. Because of their post-processing nature, these techniques are not constrained by immediate OR requirements and therefore can be applied in real-world scenarios much easier than real-time assistance features. Nevertheless, they have to be adapted to the specific peculiarities of this very specific domain. Their practical relevance is steadily growing considering the fact that video documentation is on the rise in recent years. Once comprehensive video documentation is established as best practice and maybe even becomes mandatory, they will be essential cornerstones in Endoscopic Multimedia Information Systems.

As we can see, there are very diverse goals and perspectives on the domain of endoscopic video processing. This survey is intended to provide a broad overview of related research in this very heterogeneous and broad field that is currently not perceived as belonging together. It also tries to point up common problems that might be easier to solve when considering findings of other fields. In an extensive literature research, more than 600 publications were found. Based on titles and abstracts, we classified them into the following three main categories which are described in the subsequent sections:

  1. 1.

    pre-processing methods

  2. 2.

    real-time support at procedure time

  3. 3.

    post-procedural applications

Figure 1 illustrates the resulting categorization of research topics in the field of endoscopic image/video processing and analysis, representing the structure of the following sections as well. This classification should not be understood as the ultimate truth because many of the presented techniques and concepts have significant overlappings and cannot be distinctively delimited. For example, the traditionally post-procedural application of surgical quality assessment is currently being ported to real-time systems and in this context could as well be regarded as an application of Augmented Reality. Nevertheless, this categorization enables a structured and clear overview of the many topics that are covered in this review.

Fig. 1
figure 1

Categorisation of publications in the field of endoscopic video analysis

2 Pre-processing methods

Endoscopic videos have various domain-specific characteristics that need to be addressed when dealing with this special kind of video. This section describes the most distinctive aspects and gives an overview of corresponding methods that are applied as a preparatory step prior to other analysis techniques and/or enhance the image quality for the surgeon.

2.1 Image enhancement

A number of publications deal with the enhancement of frames from endoscopic videos in order to improve the visual quality of the video. That means that the underlying data, i.e., the pixels of the individual frames are not only analyzed but also modified while other analysis approaches described in the upcoming sections only try to extract information without changing the content. In this context a number of well-established general purpose image processing techniques can be applied, but this section will focus on techniques and research findings that specifically address the domain of endoscopy. Another aspect that is particularly important in this context is real-time capability because the optimized result should instantly be visible at the screen during a procedure. However, image enhancement and pre-processing is not only interesting for real-time applications but can also be of great importance as a preparation step for any kind of further automatic processing. Early work in this area includes:

  • Automatic adjustment of contrast with the help of clustering and histogram modification [207].

  • Removal of temporal noise, i.e., small flying particles or fast moving smoke only appearing for a short moment at one position, by using a temporal median filter of color values [251].

  • Color normalization using an affine transformation in order to get rid of a reddish tinge caused by blood during therapeutic interventions and to obtain a more natural color [251].

  • Correction of color misalignment: Most endoscopes do not use a color chipset camera but a monochrome chipset that only captures luminance information. To get a color video, red, green and blue color filters have to be applied sequentially. In case of rapid movements - which occur frequently in endoscopic procedures - the color channels become misaligned. This is not only annoying when watching the video but particularly hindering further automatic analysis. Dahyot et al. [47] propose to use color channels equalization, camera motion estimation and motion compensation to correct the misalignments.

2.1.1 Camera calibration and distortion correction

Typical endoscopes have a fish-eye lens to provide a wide-angle field of view. This characteristic is useful because the endoscopist can see a larger area. However, the drawback is a non-linear geometric distortion (barrel distortion). Objects located in the center of the image appear larger and lines get bended as illustrated in Fig. 2a. This distortion has to be corrected prior to advanced methods that rely on correct geometric information, e.g., 3D reconstruction or image registration. The basic problem is to find the distortion center and the parameters that describe the extent of the distortion, which is not constant but depends on the respective endoscope. This process is also known as camera calibration and includes the determination of intrinsic and extrinsic camera parameters. Vijayan et al. [247] proposed to use a calibration image showing a rectangular grid of dots. This image is captured by the endoscope, resulting in a distorted version of the calibration image. Then the transformation parameters from this distorted image to the original calibration image are calculated using polynomial mapping and least squares estimation. These parameters are used to build a model that can then be used to correct the actual frames from the endoscopic video. This approach was further improved in [277] and [77]. A further approach in [273] is not only applicable to forward viewing endoscopes but also to oblique viewing endoscopes. Their camera model is able to compensate the rotation but has a higher complexity and more parameters. For calibration, they use a chess pattern image instead of a grid of dots. Further publications using this calibration pattern are [11, 12, 223]. In [72], the authors investigate if distortion correction also affects the accuracy of CAD (Computer Aided Diagnosis). The surprising result was that for many feature extraction techniques the performance did not improve but was even worse than without distortion correction. Only for shape-based features that rely on geometrical properties a modest improvement was observed. Further research results in this field can be found in [90, 123, 264].

Fig. 2
figure 2

Illustration of image enhancement methods for endoscopy

2.1.2 Specular reflection removal

Endoscopic images often contain specular light reflections, also called highlights, on the wet tissue surface. They are caused by the inherent frontal illumination and are very distracting for the observer. A study conducted in [252] shows that physicians prefer images where they are corrected. Even worse, reflections severely impair analysis algorithms because they introduce wrong pixel values and additional edges. This also impairs image feature extraction, which is an essential technique for reconstruction, tracking etc. Hence, a number of approaches for correction have been proposed as a supporting component for other analysis methods, e.g., detection of non-informative frames [169], segmentation and detection of surgical instruments [34, 201], tracking of natural landmarks for cardiac motion estimation [70], reconstruction of 3D structures [226] or correction of color channel misalignment [8].

Most approaches consist of two phases. First, the highlights are detected in each frame. This is rather straightforward and in most cases uses basic histogram analysis, thresholding and morphological operations. Pixels with an intensity above a threshold are regarded as highlights. Some authors additionally propose to check for low saturation as a further strong indication for specular highlights ([169, 274]). In this context, the usage of various color spaces has been proposed, e.g., RGB [8], YUV [222], HSV [169], HSI [274], CIE-xyY [143]. In a second phase, the pixels identified as reflections are “corrected”, i.e., modified in a way that the resulting image looks as realistic as possible. An example of a corrected image can be seen in Fig. 2b. An important aspect is that user should be informed about this image enhancement, because one cannot rule out the possibility that wrong information is introduced, e.g., a modified pit pattern on a polyp that can adversely affect the diagnosis. For this second phase, the following two different approaches can be distinguished:

  • Spatial interpolation: Only the current frame is considered and the pixels that correspond to specular highlights are replaced by interpolated pixels from the surrounding. This technique is also called inpainting and has its origins in image restoration (for an overview of general inpainting techniques refer to [208]). For the interpolation, different methods have been proposed, e.g., spectral deconvolution [222], anisotropic confidence-based filling-in [70, 71] or a two-level inpainting where first the highlight pixels are replaced with the centroid of their neighborhood and finally a gaussian blur is applied to smooth the contour of the interpolated region [8].

  • Temporal interpolation: With inpainting techniques, an actually correct reconstruction is not possible because the real color information for highlight pixels cannot be determined from the current frame. Hence, several approaches have been proposed that consider the temporal context [226, 253], i.e., try to find the corresponding position in preceding and subsequent frames and reuse the information from this position. This approach has a higher complexity than inpainting but it can be used to reconstruct the real information. However, this is not always possible, especially if there is too little or very abrupt motion or if the lighting conditions change too much. Moreover, it is not applicable to single frames or WCE (Wireless Capsule Endoscopy) frames.

2.1.3 Image rectification

In surgical practice, a commonly used type of endoscopes are oblique-viewing endoscopes (e.g., 30 °). The advantage of this design is the possibility to easily change the viewing direction by rotating the endoscope around its axis. This enables a larger field of view. The problem is that also the image rotates, resulting in a non-intuitive orientation of the body anatomy. The surgeon has to unrotate the image in their mind in order to not lose their orientation. The missing information about the image orientation is especially a problem in Natural Orifice Translumenal Endoscopic Surgery (NOTES), where a flexible endoscope is used (as opposed to rigid endoscopes like in laparoscopy). Some approaches have been proposed that use modified equipment to tackle this problem, e.g., an inertial sensor mounted on the endoscope tip [80], but hardware modifications always limit the practical applicability. Koppel et al. [102] propose an early vision-based solution. They track 2D image features to estimate the camera motion. Based on this estimation, the image is rectified, i.e., rotated such that the natural “up” direction is restored. Moll et al. [149] improve this approach by using the SURF descriptor (Speeded Up Robust Features), RANSAC (Random Sample Consensus) and a bag-of-visual-words approach based on Integrated Region Matching (IRM). A different approach [60] exploits the fact that endoscopic images often feature a “wedge mark”, a small spike outside the image circle that visually indicates the rotation. By detecting the position of this mark, the rotation angle can easily be computed.

2.1.4 Super resolution and comb structure removal

In diagnostic endoscopic procedures like colonoscopy it is important to visualize very fine details - e.g., patterns on the colonic mucosa surface - to make the right diagnosis. Super-resolution [237] has been proposed as a means to increase the level of detail of HD videos and enable a better diagnosis - both manual and automatic [75, 76]. The idea of super resolution is to combine the high frequency information of several successive low resolution images to an image with higher resolution and more details. However, the authors come to the conclusion that their approach neither has a significant impact on the visual quality nor on the classification accuracy. Duda et al. propose to apply super-resolution [55] for WCE images. Their approach is very fast because it simply computes a weighted average of the upsampled and registered frames and is shown to perform better than bilinear interpolation.

Rupp et al. [200] use super-resolution for a different task, namely to improve the calibration accuracy of flexible endoscopes (also called fiberscopes). This type of endoscope uses a bundle of coated glass fibers for light transmission. This produces image artifacts in the form of a comb-like pattern (see Fig. 2c). This comb structure hampers an exact calibration, but also many other analysis tasks like feature detection. Several methods for comb structure removal have been proposed, e.g., low pass filtering [50], adaptive reduction via spectral masks [266], or spatial barycentric or nearest neighbor interpolation between pixels containing fiberscopic content [57]. These methods typically contain some kind of low pass filtering, meaning that edges and contours are blurred. These lost high frequency components can be restored by applying super-resolution algorithms.

2.2 Information filtering

Endoscopic videos typically contain a considerable amount of frames that do not carry any relevant information and therefore are useless for content-based analysis. Hence, it is desirable to automatically detect such frames and sort them out, i.e., perform a temporal filtering. This can be regarded as a different kind of pre-processing, with the difference that not the pixels of individual frames are modified but the video as such is modified to the effect that frames are removed. This idea is closely related to video summarization (see Section 4.2.4), which can be seen as an intensification of frame filtering. In video summarization, the goal is to select especially informative frames or sequences and reduce the video to an even higher extent. Moreover, it is often the case that only parts of on image are non-informative, but other regions are indeed relevant for the analysis. To concentrate analysis on such selected regions, several image segmentation techniques have been proposed to perform a spatial filtering.

2.2.1 Frame filtering

In the literature, different criteria can be found for a frame to be considered as informative or non-informative. The most important criterion is blurriness. According to [9], about 25 % of the frames of a typical colonoscopy video are blurry. Oh et al. [169] propose to use edge detection and compute the ratio of isolated pixels to connected pixels in the edge image to determine the blurriness. As this method depends very much on the selection of thresholds and further parameters, they propose a second approach using discrete Fourier transformation (DFT). Seven texture features are extracted from the gray-level co-occurrence matrix (GLCM) of the resulting frequency spectrum image and used for k-means clustering to differentiate between blurry and clear images. A similar approach by [9] uses the 2D discrete wavelet transform with a Haar wavelet Kernel to obtain a set of approximation and detail coefficients. The L 2 norm of the detail coefficients of the wavelet decomposition is used as feature vector for a Bayesian classifier. This method is nearly 10-times faster than the DFT-based method and also has a higher accuracy. Rangseekajee and Phongsuphap [189 and Rungseekajee et al. 199] on the other side took up the edge-based approach for the domain of thoracoscopy and added adaptive thresholding as pre-processing step to reduce the effect of lighting conditions. Besides, they claim that the Sobel edge detector is more appropriate for this task than the Canny edge detector because it detects less edges due to irrelevant details caused by noise. Another approach [10] uses inter-frame similarities and the concept of manifold learning for dimensionality reduction to cluster indistinct frames. Grega et al. [68] compared the different approaches for the domain of bronchoscopy and reported results for F-measure, sensitivity, specificity and accuracy of at least 87 % or higher. According to them, the best-scoring alternative is a transformation-based approach using discrete cosine transformation (DCT).

Especially in the context of WCE (Wireless Capsule Endoscopy), the presence of intestinal juices is another criterion for non-informative images. Such images are characterized by bubbles that occlude the visualization field. Vilarino et al. [248] use Gabor filters to detect them. According to their studies, 23 % of all images can be discarded, meaning that the visualization time for the manual diagnostic assessment as well as the processing time for automatic diagnostic support can be considerably reduced. In [13], a similar approach is proposed that uses a Gauss Laguerre transform (GLT)-based multiresolution texture feature and introduces a second step that uses spatial segmentation of the bubble region to classify ambiguous frames.

A further type of non-informative frames are out-of-patient frames, i.e., frames from scenes that are recorded outside the patients body. They often occur at the beginning or end of a procedure because it is not always possible to start and stop the recording exactly at the right time. The need for manual recording triggering in general deters many endoscopists from recording videos at all. To address this issue, [218] propose a system that automatically detects when a colonoscopic examination begins and ends. Every time a new procedure is detected, the system starts recording and writes a video file to the disk until the end of the procedure is detected. The proposed approach uses simple color features that work well for the domain of colonoscopy. In [217], the authors extend their approach by various temporal features that take into account the amount of motion to avoid false positives.

2.2.2 Image segmentation

Instead of discarding complete frames, some authors try to identify non-informative regions in endoscopic images. In further processing steps, only the informative regions have to be considered, which speeds up processing and improves accuracy. Such a spatial filtering can also be used as basis for temporal filtering by defining a threshold ratio between the size of informative and non-informative regions. A typical irrelevant region is the border area outside the characteristic circular content area of endoscopic images. It contains no useful information but only noise that impairs analysis as well as compression. In [159], an efficient domain-specific algorithm is proposed to detect the exact circle parameters. Bernal et al. [20] propose a model of appearance of non-informative lumen regions that can be discarded in a subsequent CAD (Computer Aided Diagnosis) component. Prasath et al. [181] also use image segmentation to differentiate between lumen and mucosa, but they use the result as a basis for 3D reconstruction. For WCE images, [135] apply morphological operations, fuzzy k-means, sigmoid function, statistic features, Gabor Filters, Fisher test, neural network, and discriminators in the HSV color space to differentiate between informative and non-informative regions. In the context of CAD, image segmentation is also used as basis for shape-based features. Here, the goal is to determine the boundaries of polyps, tumors and lesions [21, 83] or other anomalies like bleeding or ulceration in WCE images [232].

In the case of surgical procedures, the most frequently addressed target of image segmentation are surgical instruments. They can be tracked in order to understand the surgical workflow or assess the skills of the surgeon. For more details on instrument detection and tracking please refer to Section 3.2.6. Few approaches have been proposed for segmentation of anatomical structures. Chhatkuli et al. [38] show how segmentation of the uterus in gynecological laparoscopy using color and texture features improves the performance of Shape-from-Shading 3D reconstruction and feature matching. Bilodeau et al. [24] combine graph-based segmentation and multistage region merging to determine the boundary of the operated disc cavity in thoracic discectomy, which is a useful depth cue for 3D reconstruction and very important in this surgery type to correctly estimate the distance to the spinal cord.

3 Real-time support at procedure time

The use case of endoscopic video analysis that has been studied most extensively in the literature is to directly support the physician during the procedure in various ways. The application scenarios can be categorized into (1) Diagnostic Decision Support and (2) Computer Integrated Surgery, which includes Augmented Reality as well as Robot-Assisted Surgery.

3.1 Diagnostic decision support

In case of diagnostic procedures like colonoscopies or gastroscopies, the main goal is to assist physicians in their diagnosis by deciding whether the anatomy is normal or abnormal. This is done by detecting and classifying suspicious patterns in images that correspond to abnormalities like polyps, lesions, inflammations and tumors. Figure 3 illustrates examples for normal (first row) and abnormal (second row) images. Such decision support systems are often called CAD (Computer Aided Diagnosis) systems and are already used to some extent in clinical practice. In general, CAD systems strive to be real-time capable to provide immediate feedback during the examination, e.g., [261]. If the physician misses a suspicious structure, the system can highlight the corresponding region to indicate that it should be investigated in detail and maybe a biopsy should be taken. If CAD is applied as post-processing, a reaction of the physician is not possible anymore. However, some state-of-the-art approaches are still too computationally expensive and can currently only be applied offline after the examination. In the special case of WCE, the diagnostic support does not have to be real-time, because the physician anyway looks at the images after the actual procedure is finished and all images have been acquired. For WCE, aside from the detection of structures like polyps or tumors, the detection of images showing bleedings is of particular interest [66, 119].

Fig. 3
figure 3

Typical normal (first row) and abnormal (second row) images [238]

CAD systems typically use pattern recognition and machine learning algorithms to identify abnormal structures. After various pre-processing steps, visual features are extracted and fed into a classifier, which then delivers the diagnosis in form of a classification result. The classifier has to be trained in advance with a possibly great number of labeled examples. The most frequently used classifiers are Support Vector Machines (SVM) and Neural Networks. Often, dimensionality reduction techniques like Principal Component Analysis (PCA) are used, e.g., in [238]. Numerous alternatives for the selection of features and classifiers have been proposed. Some approaches exploit the characteristic shape of polyps, e.g., [21]. Considering the shape also enables unsupervised methods, which require no training, e.g., the extraction of geometric information from segmented images [83] or simple counting of regions of a segmented image [49]. Many approaches use texture features, e.g., based on wavelets or local binary patterns (LBP), e.g., [6, 86, 118, 257]. In many cases, color information adds a lot of additional information, so color-texture combinations are also common [93]. Also simple color and position approaches have been shown to perform reasonable despite their low complexity compared to more sophisticated approaches [4].

Most publications concentrate on one specific feature, but there are also attempts to use a mix of features. As example, Zheng et al. propose a Bayesian fusion algorithm [279] to combine various feature extraction methods, which provide different cues for abnormality. A very recent contribution by Riegler et al. [193] proposes to employ Multimedia methods for disease detection and shows promising preliminary results. A detailed survey of gastro-intestinal CAD-systems can be found in [122].

3.2 Computer integrated surgery

In the case of surgical endoscopy, we can differentiate between “passive” support in the form of Augmented Reality and “active” support in the form of robotic assistance.

In the former case, supplemental information from other image modalities (MRT, CT, PET etc.) is displayed to improve navigation, enhance the viewing conditions or provide context-aware assistance. In the latter case, surgical robots are used to improve surgical precision for complex and delicate tasks. While early systems acted as direct extender of the surgeons movements, recent research activity strives for more and more actions carried out by the robot autonomously. Both cases pose a number of typical Computer Vision problems (object detection and tracking, reconstruction, registration etc. in order to “understand” the surgical scene), hence video analysis is an essential component.

All these ideas and techniques can be subsumed under the concept of Computer Integrated Surgery (CIS), or sometimes also referred to as surgical CAD (due to the popularity of the term CAD) [235]. The underlying idea is to integrate all phases of treatment with the support of computer systems, and in particular medical imaging. This includes intra-operative endoscopic imaging as well as pre-operative diagnostic imaging modalities like CT (X-ray Computed Tomography), MRI (Magnetic Resonance Imaging), PET (Positron Emission Tomography) or sonography (“ultrasound”). These modalities are used for diagnosis and planning of the actual procedure and often are essential for the precise navigation to the surgical site [14]. This is especially the case for surgeries that require a very high accuracy due to the risk of damaging healthy tissue (e.g., endonasal skull base surgery [145]). Navigation support is also important for diagnostic procedures like bronchoscopy (examination of the lung airways) where the flexible endoscope has to traverse a complex tree structure with many branches to find the biopsy site that has been identified prior to the examination [48, 79]. These pre-operative images or volumetric models are aligned with general information about human anatomy (anatomy atlases) in order to create a patient-specific model that enables a comprehensive and detailed procedure planning. This pre-operative model is then registered to the intra-operative video images in real-time to guide the surgeon by overlaying additional information, performing certain tasks autonomously or increasing surgical safety by imposing constraints on surgical actions that could harm the patient. For such an assistance, the system has to monitor the progress of the procedure and in case of complications automatically adapts/updates the surgical plan. [234]

3.2.1 Augmented reality

An essential concept for CIS is Augmented Reality (AR) - a technique to augment the perception of the real world with additional virtual components that are generated by the computer and have to be aligned to the real-world environment in real-time. In the case of endoscopic surgery, the “virtual” components are usually obtained from the pre-operativ patient model and surgical plan. They can be visualized in several ways, e.g., on the ordinary monitor, through a head mounted display (HMD), which can either be an optical or a video see-through HMD, or as projection directly on the patient body [209]. Without AR, the surgeon has to mentally merge this isolated information with the live endoscopic view, which causes additional mental load. By augmenting the endoscopic video images with this kind of information, target structures can be highlighted for easier localization and hidden anatomical structures (e.g., vessels or tumors below the organ surface) can be visualized as overlay in order to improve safety and avoid surgical errors and complications. The key challenge for AR is to align the endoscopic video with the pre-operative data, i.e., to fuse them to a common coordinate system - a technique called image registration. The problem is massively exacerbated by the fact that the soft tissue is not rigid but shifting and deforming. Hence, another research challenge is to track the tissue deformation to derive deformation models in order to update the registration. In addition to Computer Vision algorithms (e.g., calibration, registration, 3D reconstruction), AR systems often rely on external optical tracking systems, which are used to determine the position and motion of the endoscope or an instrument [23, 29]. This implies that instruments have to be modified by attaching markers, which are tracked by an array of infrared cameras rigidly mounted on the ceiling of the operating room. The drawback of such methods is the limited applicability in a practical scenario due to the necessary hardware modification. An example for the application of AR is depicted in Fig. 4.

Fig. 4
figure 4

MRI image showing a uterus with two myomas, the corresponding pre-operative model and the visualization of the overlay on the endoscopic image [45] Ⓒ 2014 IEEE

Surgical navigation

AR is especially helpful in the field of surgical oncology [167], i.e., the surgical management of cancer. The exact position and size of the tumor is often not directly visible in the endoscopic images, hence an accurate visualization helps to choose an optimal dissection plane that minimizes damage of healthy tissue. Such systems are often called “Surgical Navigation Systems” because they support the navigation to the surgical site. They have been proposed for various procedures, e.g., prostatectomy (removal of the prostate gland) [82, 210]. Mirota et al. [146] provides a comprehensive overview of vision-based navigation in image-guided interventions for various procedure types.

Viewport enhancement

A further possible application of AR is to improve the viewing conditions of the surgeon. This can be done by expanding the restricted viewport and visualizing the surrounding area using image stitching methods [19, 153, 241], potentially even in 3D [265]. Similar techniques have also been investigated for the purpose of video summarization, e.g., to obtain a condensed representation of an examination for a medical record (see Section 4.2.4). Other approaches even provide an alternative point of view with improved visualization. In [23], a “Virtual Mirror” is proposed that enables the surgeon to inspect the virtual components (e.g., a volumetric model of the liver from a pre-operative CT scan) from different perspectives in order to understand complex structures (e.g., blood vessel trees) and improve depth perception as well as navigational tasks. Fuchs et al. [59] propose a prototypical system that restores the physicians’ natural viewpoint and visualizes it via a See-Through HMD. The underlying idea is to free the surgeon from the inherent technical limitations of the imaging system and enable a more “direct” view on the patient, similar to that in traditional open surgery. The surgeon can change the viewing perspective by moving his head instead of moving the laparoscope.

3.2.2 Context awareness

Several medical studies prove the clinical applicability of Augmented Reality in various endoscopic operation types, e.g., nephrectomy [236], prostatectomy [246], laparoscopic gastrointestinal procedures [229] or splenectomy [88]. However - despite all the potential benefits - too much additional information may distract surgeons from the actual task [51]. The goal should be to automatically select the appropriate assistance for the current state of the procedure in a context-aware manner, providing hints or situation-specific additional information. Such assistance can also go beyond visualizations of the pre-operative model, e.g., it can support decision making by finding situations similar to the current one and showing how other surgeons handled a similar exceptional situation [95]. Another use case is to simulate the effect of a surgical step without actually executing it [115]. Speidel et al. [214] and [228] propose an AR system for warning in case of risk situations. The system alerts the surgeon if an instruments comes too close to a risk structure (ductus cysticus or arteria cystica in the case of cholecystectomy).

The basis for such context-aware assistance is the semantic understanding of the current situation. The field of Surgical Process Modeling (SPM) is concerned with the definition of appropriate workflow models [110, 166]. The main challange is to formalize the existing expert knowledge, be it formal knowledge from textbooks or experience-based knowledge that also considers variations and deviations from theory. First, typical surgical situations or tasks have to be defined. Some approaches focus on fine-granular gestures like for example “insert a needle”, “grab a needle”, “position a needle” [18], or more generic actions like tool-tissue interaction in general [255]. The detection of such “low-level” tasks can also be used as basis for the assessment of surgical skills (see Section 4.1.1). On a higher abstraction level, a procedure is subdivided into pre-defined operation phases that describe the typical sequence of a surgery. Existing approaches focus on well standardized procedures, in most cases cholecystectomy (removal of the gallbladder) [25, 95, 99, 109, 174], which can be broken down to distinct phases very well. Surgical workflow understanding is also of particular interest for post-procedural applications, especially for temporal video segmentation and summarization (see Section 4.2.3).

A very discriminative feature to distinguish between phases is the presence of operation instruments, which can be detected by video analysis (see Section 3.2.6 for more information). However, metadata obtained from video analysis is only one of many possible inputs for surgical situation understanding systems proposed in the literature. Often, various additional sensor data are used, e.g., weight of the irrigation and suction bags, the intra-abdominal C O 2 pressure and the inclination of the surgical table [221] or a coagulation audio signal [262]. The focus in this research area is not on how to obtain the required information from the video, but how to map the available signals (e.g., binary information about instrument presence) to the corresponding surgical phase. Therefore, instrument detection is often achieved by hardware modifications like RFID tags or color markers or even by manual annotations from an observer [166].

Several authors propose to use statistical methods and machine learning to recognize the current situation. The most popular method are Hidden Markov Models (HMM) [25, 111, 174, 196]. An HMM is a directed graph that defines possible states and transition probabilities and is built from training data. Another frequently used method is Dynamic Time Warping (DTW) [58], which can also be used without explicit pre-defined models by temporally aligning surgeries of the same type [3]. Besides HMM and DTW also alternative methods like Random Forests have been proposed [221]. A fundamentally different approach is to use formal knowledge representation methods like ontologies that use rules and logical reasoning to derive the current state. For example, Katic et al. use Description Logic in the OWL-standard (Web Ontology Language) [94, 95, 214].

3.2.3 Robot-assisted surgery

The majority of publications dealing with endoscopic video analysis have their roots in the robotics community and aim at integrating robotic components into the surgical workflow. Medical robots are not intended to replace human surgeons, but to extend their capabilities and improve efficiency and effectiveness by overcoming certain limitations of traditional laparoscopic surgery. They considerably enhance the dexterity, precision and repeatability by using mechanical wrists that are controlled by a microprocessor. Motion can by stabilized by filtering hand tremor and scaled for micro-scale tasks which are not possible manually [115]. Robotic surgery systems are practically used for several years, especially for very precarious procedures like prostatectomy [250]. However, current systems like the daVinci system [74] are pure “surgeon extenders”, i.e., the surgeon directly controls the slave robot via a master console. In this telemanipulation scenario, the robot has no autonomy and only “enhances” the surgeons movements, e.g., by hand tremor filtering and motion scaling. An overview of popular surgical robot systems can be found in [230] and [250]. State-of-the art research tries to extend the robots autonomy, which requires numerous image/video analysis and Computer Vision techniques. Robotic systems inherently provide additional data that can be used to facilitate video analysis, e.g., kinematic motion data, information about instrument usage and stereoscopic images that allow for an easier 3D reconstruction of the scene.

An important application with mediocre degree of autonomy is the automation of endoscope holding [36, 254, 256, 278]. This task is usually carried out by an assistant, but during lengthy procedures, humans suffer from fatigue, hand tremor etc., therefore automation of this task is very appreciated by surgeons. The endoscope should always point at the current area of interest, which is typically characterized by the presence of the instrument tips. Hence, instrument positions have to be detected and the robot arm has to be moved such that the endoscope is adequately positioned without colliding with tissue, and the right zoom level is chosen [211]. Another application are automatic safety checks, e.g., in the form of active constraints respectively virtual fixtures. A virtual fixture [140, 177] is a constraint that reduces the precision requirements. It can be used to define forbidden regions or a safety margin around critical anatomical structures, which must not be damaged, in order to prevent erroneous moves, or to simplify the execution of a task by “guiding” the instrument motion along a safe corridor [30]. The long-term vision is to enable commonly occurring tasks like suturing to be executed autonomously by high-level command of the surgeon (e.g., by pointing at the target position with a laser pointer) [105, 175, 220]. The main challenge is to safely move the instrument to the desired 3D position without harming the patient, a process referred to as visual servoing. Such an assistance requires a very detailed understanding of the surgical scene, including a precise 3D model of the anatomy, registered to the pre-operational model and also considering tissue deformations, as well as the exact location of relevant anatomical objects and instruments. Also surgical task models as discussed above are of great importance for this scenario. A survey of recent advances in the field of autonomous and semi-autonomous actions carried out by robotic systems is given in [156].

3.2.4 3D reconstruction

The reconstruction of the local geometry of the surgical site is an essential requirement for Robot-Assisted Surgery. It produces a three-dimensional model of the anatomy in which the instruments are positioned. Also for many Augmented Reality applications a 3D model is required for registration with volumetric pre-operative models (e.g., from CT scans). For diagnostic procedures, the analysis of the 3D shape of suspicious objects can be more expressive than the 2D shape. The fundamental challenge of 3D reconstruction is to map the available 2D image coordinates to 3D world coordinates.

In the context of Robot-Assisted Surgery, usually stereoscopic endoscopes are used to improve the depth perception of the surgeon. The stereo images also facilitate correspondence-based 3D reconstruction [22, 194, 225]. The challenge is to identify matching image primitives (e.g., feature points) between the left and right image, which can then be used to calculate the depth information by triangulation. However, this task is still far from being trivial because of various aggravating factors like homogenous surfaces with few distinct visual features, occlusions, specular reflections, image perturbations (smoke, blood etc.) and tissue deformations. In traditional laparoscopy, which is not supported by a robot, the used endoscopes are usually monoscopic. In this case, Structure-From-Motion (SfM) methods can be applied [44, 81, 138]. For SfM, the different views are obtained when the camera is moved to a different position. Camera motion estimation is required to estimate the displacement of the camera position, which is necessary for the triangulation, while in the stereoscopic case, the displacement is inherently known. A related method that is often used in the robotics domain is SLAM (Simultaneous Localization And Mapping), which iteratively constructs a map of the unknown environment and at the same time keeps track of the camera location. Traditional SLAM assumes a rigid environment, which does not hold for the endoscopic case. Therefore, attempts have been made to extend SLAM with respect to tissue deformations [67, 154, 242]. A further common problem for both SfM and SLAM is the often scarce camera motion. An alternative approach that deals with single images and therefore does not depend on the problem of finding correspondences is Shape-from-Shading (SfS), where the depth information is derived from the shading of the surface of anatomic objects. However, again some basic assumptions of generic SfS do not hold for endoscopic images, hence adaptations are necessary to obtain acceptable results [43, 171, 269]. The generalization of SfS to multiple light sources is referred to as photometric stereo and has also been proposed for 3D reconstruction of endoscopic videos [42, 178]. This techniques requires hardware modifications to obtain different illumination conditions. Other active methods requiring hardware modifications that have been proposed are Structured Light [1] and Time-of-Flight [179]. The former projects a known light pattern on the surface and reconstructs depth information from the deformation of the pattern. The latter uses non-visible near-infrared light and measures the time until it is reflected back.

All these approaches have their drawbacks, hence several attempts have been made to improve the performance by fusing multiple depth cues, e.g., stereoscopic video and SfS [127, 249], SfM and SfS [139, 239], or by incorporating patient specific shape priors extracted from pre-operative images [7]. However, 3D reconstruction still remains a very challenging task in the endoscopic domain. Several recent surveys about this topic are available [63, 69, 136, 176].

3.2.5 Image registration and tissue deformation tracking

The process of bringing two images of the same scene together to one common coordinate system is referred to as image registration. This includes the computation of a transformation model that describes how one image has to be modified to obtain the second image. This is a typical optimization problem that can be solved with optimization algorithms like Gauss-Newton etc. The classical application of medical image registration is to align images from different modalities, e.g., pre-operative 3D CT images and intra-operative 2D X-ray projection images [141]. We can distinguish between different types of registration, depending on the dimensionality of the underlying images (2D slice, 3D volumetric, video), as reviewed in [121].

For Augmented Reality scenarios where the endoscopic video stream is used as intra-operative modality, various use cases have been addressed in the literature, e.g., registering 3D CT models of the lungs with bronchoscopic videos [79, 131]. Similar examples relying on registration are sinus surgery (nose) [31] or skull base surgery [147]. In terms of laparoscopic surgery, interesting contributions are 3D-to-3D registration of the liver during laparoscopic surgery [227] and coating of the pre-operative 3D model with texture information from the live view [258]. An important prerequisite for registration is an accurate camera calibration (see Section 2.1.1) to obtain correct geometric correspondences.

A topic closely related to image registration is object tracking, i.e., following the motion of a region of interest over time, either in the two- or three-dimensional space. It can be seen as an intra-modality (as opposed to inter-modality) registration, i.e., successive frames of a video are registered. In case of camera motion between frames, the transformation can be described by translation, rotation and scaling. Estimating the camera motion is an important technique for 3D reconstruction (especially SfM and SLAM) and generating panoramic images. However, in the endoscopic video domain, the transformation is usually much more complex. The main reason is the fact that the soft tissue is not rigid but is deforming non-linearly, requiring adaptations of many established methods which assume a rigid environment (e.g., SLAM). Tissue deformation occurs for three main reasons, (1) organ shift due to insufflation (mainly relevant for inter-modality registration), (2) periodic motion caused by respiratory and cardiac cycles as well as muscular contraction and (3) tool-tissue interaction. Tracking the tissue deformation is a challenging research topic that has strongly gained attention in the last years. It is particularly important for updating the reconstructed 3D model that has to be registered with the static pre-operative model for Augmented Reality applications and Robot-Assisted Surgery, but also to track anatomical modifications for surgical workflow understanding. Periodic motion can be well described by a model, e.g., Fourier series [192]. Estimating the periodic motion of the heart is of particular interest for robot-assisted motion compensation and virtual stabilization [202].

Both registration and tissue tracking share the basic problem of finding a set of correspondences between two images in order to compute the transformation model. This is usually based on some kind of “landmarks” that can be identified in both images. In a matching step, an algorithm decides which landmarks represent the same position. Landmarks can either be artificial, e.g., represented by color markers attached to the tracking target like in [202], or natural. For tissue tracking, artificial markers (also called fiducials or fiducial markers) can hardly be used, therefore tracking algorithms have to rely on natural landmarks, i.e., salient image features that can clearly be distinguished from their surrounding and are unique, e.g., vessel junctions and surface textures. One possibility is to work in the image space and use region-based representations of regions of interest in the form of pixel patches. However, these representations are often not suffiently expressive and not very robust against illumination changes, specular highlights and occlusions. Hence, feature-based representations have established as preferred method. They allow to detect natural landmarks and extract specific information that is represented by a feature descriptor. The most common descriptors are SIFT (Scale Invariant Feature Transform) [130] and SURF (Speeded-Up Robust Features) [15]. They are popular because of their beneficial characteristics like scale and rotation invariance and robustness against illumination changes and noise. Further feature descriptors that have been used for tissue tracking are MSER (Maximally Stable Extremal Regions) [142, 224], STAR, which is a modified version of the Center Surrounded Extremas for Real-time Feature Detection (CenSuRE) [2], and BRIEF (Binary Robust Independent Elementary Features) [32, 275]. Mountney et al. [150] provide a comprehensive evaluation and comparison of numerous feature descriptors and present a framework for descriptor selection and fusion. Figure 5 illustrates correspondences between several pairs of images.

Fig. 5
figure 5

Illustration of finding corresponding natural landmarks between pairs of images with (a) significant rotation, (b) scale change, (c) image blur (d) tissue deformation, combined with illumination changes [64] Ⓒ 2009 IEEE

The matching strategy typically used by feature-based approaches is “tracking-by-detection”, as opposed to recursive tracking methods, e.g., Lucas Kanade, which is based on the optical flow. The latter search locally for a best match for image patches, while the former extract features for each frame and then compare them to find the best matches. While recursive methods work well on small deformations, they have problems with illumination changes and reflections and suffer from error propagation. In contrast, tracking-by-detection in combination with feature-based region representation is fairly robust against large deformations and occlusions due to the abstracted feature space. A comparative evaluation of state-of-the-art feature-matching algorithms for endoscopic images has been carried out in [186].

However, although promising advances have been made recently [155, 187, 188, 205, 231, 268], deforming tissue tracking is a very hard research challenge that still requires a lot of further work. Endoscopic videos feature many domain-induced problems like scarcity of distinctive landmarks because of homogenous surfaces and indistinctive texture that makes it hard to find good points to track. Moreover, occlusions, specular reflections, cauterization smoke, blood and fluids lead to tracking points being lost. Hence, one of the main problems is a robust long-term tracking. Last but not least, the real-time requirement poses a demanding challenge. A survey about three-dimensional tissue deformation recovery and tracking is available in [152].

3.2.6 Instrument detection and tracking

Besides anatomical objects, surgical instruments are the most important objects of interest in the surgical site. Therefore, a key requirement for scene understanding, Robotic-Assisted Surgery and many other use cases is to detect their presence and track their position as well as their motion trajectories. The precision requirements differ with the application scenario. For surgical phase recognition, it is often sufficient to know which instruments are present at all. In this case, it is already sufficient to equip the instruments with a cheap RFID tag to detect the insertion and withdrawal [103]. In terms of visual analysis, a classification of the full frame can be carried out to detect the presence of instruments [183]. The next level is to determine the position of the instrument in the two-dimensional image, or more specifically the position of the instrument tip, which is the main differentiation characteristic between different types of instruments [213]. Also for instrument tracking, the position of the tip is usually considered as reference point.

Many approaches proposed in the literature use modified equipment for localizing instruments. The most common modification are color markers on the shaft that can easily be detected and segmented [240, 263]. To distinguish between multiple instruments, different colors can be used [26]. Nageotte et al. [165] use a pattern of 12 black spots on a white surface. This modification also enables pose estimation to some extent. Krupa et al. [105] use a laser pointing device on the tip of the endoscope to derive the relative orientation of the instrument with respect to the organ. This approach even works if the instrument is outside the current field of view. Besides the fact that these methods have a limited applicability to arbitrary videos from an archive, another disadvantage of such modifications is that the biocompatibility and sterilizability has to be ensured, as they have direct contact to human tissue. Also internal kinematic data from a robot can be used to estimate the position of instruments, but is generally not accurate enough, especially when force is applied to a surface [30]. However, kinematics can be useful as supplementary source of information to get a coarse estimation that is refined by visual analysis [219].

Also purely vision-based approaches without any hardware modification have been proposed. Doignon et al. [53] perform a color segmentation mainly using the saturation attribute to differentiate the achromatic instrument from the background. Voros et al. [256] define a cylindrical shape model and use geometric information to detect the edges of the tool shaft and the symmetry axis. A similar approach using the Hough transform to detect the straight lines of the instrument shaft is presented in [41]. These approaches face a number of challenges like homogenous color distribution, indistinct edges, occlusions, blurriness, specular reflections and artifacts like smoke, blood or liquids. Moreover, methods have to deal with multiple instruments, as surgical procedures rarely rely on one single instrument.

The final stage of instrument identification and one of the current research challenges is to determine the exact position of the tip in the three-dimensional space and track its motion trajectories [5, 33]. In this context, also the estimation of the instrument pose is of particular importance [54]. This knowledge is necessary for use cases like visual servoing, context-aware alerting and skills assessment. Given geometric constraints can be exploited to facilitate tracking, particularly the motion constraint, which is imposed by the stationary incision point. Knowledge about this point restricts the search area for region seeds for color-based segmentation [52] and enables modeling of possible instrument motion [267]. Recently, a number of advanced and very sophisticated approaches for 3D instrument tracking and pose estimation have been proposed, e.g., training of appearance models of individual parts of an instrument (shaft, wrist and finger) using color and texture features [180], learning of fine-scaled natural features in the form of particular 3D landmarks on instrument tips using Randomized Trees [191], learning the shape of instruments using HOG descriptors and Latent SVM to probabilistically track them [107], and approaches to determine semantic attributes like “open/closed, stained with blood or not, state of cauterizing tools” etc. [108].

4 Post-procedural applications

In recent years, it became more and more common to capture videos of endoscopic procedures. We experience a trend towards a comprehensive documentation where entire procedures are recorded, stored and archived for documentation, retrospective analysis, quality assurance, education, training and other purposes. The question arises how to handle this huge corpus of data? The emerging huge video archives pose a challenge to management and retrieval systems. The Multimedia community has proposed several methods to enable summarization, retrieval and management of such data, but this research topic is clearly understudied as compared to the real-time assistance scenario. This section gives an overview of these methods, which can be regarded as kind of post-processing. They do not have the requirement to work in real-time because they operate on captured video data and can be executed offline. Nevertheless, performance is important in order to keep up with the constantly growing data volume.

4.1 Quality assessment

An important application for post-procedural usage of endoscopic videos that has been studied intensively is quality assessment of individual surgeon skills and of entire procedures. Quality control of endoscopic procedures is a very important, but also very difficult issue. In current practice, an experienced surgeon has to review videos of surgeries to subjectively assess the skills of the surgeon and the quality of the procedure. This is a very time-consuming and cumbersome task that cannot be carried out extensively due to the high effort. Hence, it is desirable to provide automatic methods for an objective quality assessment that can be applied to each and every procedure and has the potential to strongly improve surgical quality by pointing out weak points and suggesting potentials for improvement. Recent works even assess quality in real-time during the procedure to provide immediate feedback to the physician and thus improve their performance [216].

4.1.1 Surgical skills assessment

Several attempts have been made to assess the psychomotor skills of surgeons by analyzing how they perform individual surgical tasks like cutting, grasping, clipping, drilling or suturing. This requires a decomposition of a procedure into individual atomic surgical gestures (often called surgemes), which can also be seen as a kind of temporal segmentation. Similar to surgical workflow understanding (page 14), Hidden Markov Models (HMM) are typically used to model and detect the different tasks. A reference model is trained by an expert and serves as basis for the skills assessment. The main parameter for the assessment is the motion trajectory of instruments. Various metrics like path length, motion smoothness, average acceleration etc. have been proposed as quality indicators.

Most classic approaches use non-standard equipment to obtain this motion data in a straightforward manner, e.g., inherently available kinematic data from a simulator or surgical robot [124], trajectory data from an external optical tracker [116] and/or haptic data from an additional three-axis force/torque sensor [197]. However, such a simple but expensive data acquisition is not suitable for a comprehensive quality assessment on a daily basis, but rather interesting for special applications like training, simulation and Robot-Assisted Surgery.

The more practical alternative for an extensive evaluation of surgeon skills is to extract the motion data directly from the video data with content-based analysis methods [173]. The advantage of this approach is that it can be applied to any video without any hardware modification. Furthermore, videos can provide contextual information with regard to the anatomical structures and instruments involved that can act as additional hints. To obtain the required motion data, instruments have to be detected and tracked (see Section 3.2.6), preferably in the three-dimensional space. Some contributions concerning this matter have recently been published, e.g., [89, 172, 276].

4.1.2 Assessment of screening quality

The skills assessment methods discussed above are rather applied for surgeries and focus mainly on the instrument handling. In diagnostic procedures, usually no instruments are used, therefore other quality criteria have to be defined. Hwang et al. [84, 168] define objective metrics that characterize the quality of a colonoscopy screening, e.g., the duration of the withdrawal phase (based on temporal video segmentation), the ratio of non-informative frames, the number of camera motion changes and the ratio of frames showing close inspections of the colon wall to frames showing a global lumen view. This framework is extended in several further publications. In [126], a “quadrant coverage histogram” is introduced that determines to what extent all sides of the mucosa have been inspected. A similar approach is proposed in [91] for the domain of cystoscopy. Here the idea is to determine to what extent the inner surface of the bladder has been inspected and if parts have been missed. A further quality criterion for colonoscopies is the presence of stool that occludes the mucosa and thus may lead to missed polyps. Color features are used to measure the ratio of “stool images” and consequently the quality of the preceding bowel cleansing [85, 164]. On the other side, the presence of frames showing the appendiceal orifice indicates a high quality examination, because it means that the examination has been performed thoroughly [259]. Also the occurrence of retroflexion, which is a special endoscope maneuver to discover polyps that are hidden behind deep folds in the peri-anal mucosa, is a quality indicator and can be detected with a method proposed in [260]. In [168], therapeutic actions are detected and also considered for quality assessment metrics. The assessment of intervention quality is mainly applied post-operatively on recorded videos, but also first attempts have been made to include these techniques in the clinical workflow and directly notify the physician about quality deficiencies [163, 170].

4.2 Management and retrieval

The goal of video management and retrieval systems is to enable users to efficiently find exactly the information they are looking for, either within one specific video or within an archive. They have to provide means to articulate some kind of query describing the information need, or special interaction mechanisms for efficient content browsing and exploration. Especially the latter aspect has rarely been addressed for this specific domain yet and therefore provides a large potential for future work.

4.2.1 Compression and storage

If an endoscopic video management system is to be deployed in a realistic scenario, domain-specific concepts for compression, storage organization and dissemination of videos are required. As for compression of endoscopic videos, literature research only showed very few contributions. For the field of bronchoscopic examinations, we found a statement that “it is possible to use lossy compressed images and video sequences for diagnostic purposes” [56, 185]. In terms of storage organization, [27] propose a system with a distributed architecture that uses a NoSQL database to provide access to videos within a hospital and across different health care institutions [28]. They also present a device for video acquisition and an annotation system that enables content querying [112]. Münzer et al. [160] show that the circular content area of endoscopic videos can be exploited to considerably improve encoding efficiency and the discarding of irrelevant segments (blurry, out-of-patient or dark and noisy) can save up to 20 % of the storage space [161]. In [162], a subjective quality assessment study is conducted that shows that it is not necessary to archive videos in the original HD resolution, but lower quality representations still provide sufficient semantic quality. This study also contains the first set of encoding recommendations for the domain of endoscopy. A follow-up study in [148] evaluates the effective savings in storage space by using domain-specific video compression on an authentic real-world data set. It comes to the conclusion that by using these encoding presets together with circle detection, relevance segmentation and a long-term archiving strategy, the overall storage capacity can be reduced by more than 90 % in the long term without losing relevant information.

4.2.2 Retrieval

One possibility to retrieve specific information in endoscopic videos is to annotate them manually. Lux et al. present a mobile tool for efficient manual annotation of surgery videos [73, 133]. It provides intuitive direct interaction mechanisms on a tablet device that make the annotation task less tedious. For colonoscopy videos, an annotation tool called Arthemis [125] has been proposed that supports the Minimal Standard Terminology (MST) of the European Gastrointestinal Society for Endoscopy (ESGE), which is a standardized terminology for diagnostic findings in GI endoscopy. However, manual annotation and tagging cannot be carried out for each and every video of a video archive due to time restrictions. It is rather interesting to use manual annotation to obtain expert knowledge from a limited number of representative examples and extract this knowledge for automatic retrieval techniques like content-based image retrieval (CBIR).

The typical use case of CBIR is to find images that are visually similar to a query image. Up to now, research in image retrieval in the medical domain rather focuses on other image modalities (CT, MRI etc.) [106, 157], but some publications also deal with endoscopic images. An early approach [272] uses simple color histograms in HSV color space to determine the similarity between colonoscopy images. A more recent approach [270] for gastroscopic images uses image segmentation in the CIE L*a*b* color space to extract a “color change feature” and dominant color information and compares images using the Kullback-Leibler distance. Tai et al. [233] also incorporate texture information by using a color-texture correlogram and the Generalized Tersky Index as similarity measure. Furthermore, they employ an interactive relevance feedback to refine the search result. Xia et al. [271] propose to use multi-feature fusion to combine color, texture and shape information. A very recent and more sophisticated approach [39] that obtains very promising results uses Multiscale Geometric Analysis (MGA) of Nonsubsampled Contourlet Transform (NSCT) and the statistical framework based on Generalized Gaussian Density (GGD) model and Kullback-Leibler Distance (KLD). Another task relying on similarity search is to link a single still image captured by the surgeon during the procedure to the according video segment in an archive [195]. The latest approach for this task uses Feature Signatures and the Signature Matching Distance and achieves reasonable results [16] . For laparoscopic surgery videos, a technique has been proposed to find scenes showing a specific instrument that is depicted in the query image [37]. In terms of video retrieval, [245] use the HOG (Histogram of Oriented Gradients) descriptor and a Fisher kernel based similarity to find similar video sequences in other videos based on a a video snippet query. This technique can on the one hand be used to compare similar situations, but on the other hand also to automatically assign a semantic tag annotation based on the existing annotation of similar sequences in an existing (annotated) video archive. The same authors also propose a method for surgery type classification that automatically differentiates between 8 types of laparoscopic surgery [244]. The method uses RGB and HSV color histograms, SIFT (Scale Invariant Feature Transform) and HOG features together with an SVM classifier and obtains promising results.

4.2.3 Temporal segmentation

While CBIR seems to work quite well for diagnostic endoscopy, it is not well studied for surgical endoscopy types like laparoscopy. In this case, the “query by example” paradigm is not very expedient. It is based on the naive assumption that visual similarity correlates with semantic similarity. However, this assumption does not necessarily hold because the semantics of a laparoscopic image or video sequence depend on a very complex context that cannot be thoroughly represented with simple low-level features like color, texture and shape. This discrepancy between low-level representation and high-level semantics is referred to as semantic gap. The key to close this gap is to additionally take into account the dynamic aspects of videos that images do not have. One of the key techniques for general video retrieval is temporal segmentation, i.e., the subdivision of a video into shots and semantic scenes. This abstraction can help a lot to better understand a video and, e.g., find the position of a certain surgical step. Unfortunately, established generic techniques cannot be applied because they typically assume that a video is composed of shots. Endoscopic videos usually have exactly one shot and hence only one scene according to commonly accepted definitions. Therefore, conventional shot detection cannot be used. This means that a new definition of shots and scenes is necessary for endoscopic videos. Some authors tried to introduce a new domain-specific notion of shots. Primus et al. [182] define “shot-like” segments based on significant motion changes. Their idea is to differentiate between typical motion types, namely no motion, camera motion and instrument motion. This approach produces a very fine-grained segmentation that can be used as basis for a more coarse-grained semantic segmentation. Cao et al. [34] define operation shots in colonoscopy videos. They are detected by the presence of instruments, which are only used in special situations in this diagnostic endoscopy type.

A more promising approach for endoscopic video segmentation is to define a model of the progress of the procedure and associate video segments with the individual phases of this model. This idea is closely related to surgical workflow understanding as discussed above (see Section 3.2.2). However, here the purpose is not to provide context-aware assistance, but to structure a video file to facilitate efficient review of procedures and retrieval of specific scenes. The fact that complete information about the whole procedure is available at processing time makes this task easier as compared to the intra-operative real-time scenario. For diagnostic procedures like colonoscopy, where the endoscope has to follow a predetermined path through several anatomic regions, it is straightforward to consider these regions as phases. Cao et al. [35] observed that transitions from one section of the colon to the next feature a certain pattern of sharp and blurry frames. This is because the physician has to steer the endoscope around anatomic “corners”. This pattern can be exploited to segment a video into semantic phases. Similarly, WCE videos can also be segmented into subvideos showing specific topographic areas like esophagus, stomach, small intestine and large intestine [46, 114, 134, 206].

In the case of laparoscopy, the presence of certain instruments can be used to distinguish between surgical phases [184]. This approach works very well for standardized procedures like Cholecystectomy. For more individualized procedures, additional cues need to be incorporated. Some alternative approaches, which are not based on surgical process modeling, have been proposed in the literature, e.g., probabilistic tissue tracking to generate motion patterns that are used to identify surgical episodes [65], smoke detection as indication of electrocautery based on ad hoc kinematic features from optical flow analysis of a grid of particles [129] and an unsupervised approach that extracts and analyzes multiple time-series data sets based on instrument occurrence and motion [97]. Another interesting contribution [98] does not incorporate temporal information, but classifies individual frames according to their surgical phase. The used features are automatically generated by genetic programming in order to overcome the challenge of choosing the right features a priori. The drawback of this approach is the long processing time for feature evolution.

4.2.4 Summarization

Endoscopic videos often have a duration of several hours, but surgeons usually do not have the time to review the whole footage. Therefore, summarization of endoscopic videos is a crucial feature of a video management system. Dynamic summaries can be used to reduce the duration by determining the most relevant parts of the video. Static summaries are useful to visualize the essence of an endoscopic procedure at a glance and can easily be archived in the patient’s medical record or even be passed down to the patient. Many general approaches for summarization of videos have been proposed in the literature, but only a few that consider the specific domain characteristics of endoscopic videos.

Lux et al. [132] present a static summarization algorithm for the domain of arthroscopy (an endoscopy type that is hardly ever addressed in the literature). It generates a single result image composed of a predefined number of most representative frames. The representativity is determined based on k-medoid clustering of color and texture features. In this context, such frames are often referred to as keyframes. Another method for keyframe extraction in endoscopic videos, which can also be used as basis for temporal segmentation, is presented in [204]. It uses the ORB (Oriented FAST and Rotated BRIEF) keypoint descriptor [198] and an adaptive threshold to detect significant differences between two frames. Lokoč et al. [128] present an interactive tool for dynamic browsing of such keyframes. The keyframes are clustered and presented in a hierarchical manner to get a quick overview of a procedure. A similar interactive presentation of keyframes from hysteroscopy videos in the form of a video segment tree is proposed in [203] and further refined in [62], together with a summarization technique that estimates the clinical relevance of segments based on the attention attracted during video acquisition. These interactive techniques build the bridge between video summarization and temporal segmentation. However, video summaries do not have to be static, but can also consist of a shortened version of the original video, as proposed for the domain of bronchoscopy in [117]. This is achieved by discarding a large number of completely non-informative frames and keeping frames that are representative or clinically especially relevant (e.g., showing the branching of airways or pathological lesions).

Many WCE-specific techniques can also be considered as kind of summarization [40, 87, 243]. The task is to reduce a very large collection of images (with some temporal relationship, but less than in a typical video because the frame rate is much lower) to the ones that are diagnostically relevant. This could as well be seen as a pre-processing or filtering step. The WCE scenario differs from the usual post-procedural review scenario because it is not an additional optional task where the physician can watch the video material again, but it is the mandatory core step of the screening. Thus, the optimization of time efficiency is especially important here. A recent survey of various image analysis techniques for WCE can be found in [92].

Also panoramic images of the surgical site can be seen as a special kind of summary that gives a visual overview, especially of examinations. Techniques for panorama generation are sometimes also calling mosaicing or stitching algorithms. If a panorama is generated during a procedure, the term dynamic view expansion is often used (see 3.2.1). The basic idea is to combine different frames of the video, which were recorded from a different perspective, to one image that extends the restricted field of view. Image stitching is closely related to image registration (see Section 3.2.5). The underlying challenge is to find corresponding points and compute the transformation between the two frames, in order to convert them to a common coordinate system. Finally, the registered images have to be blended together to create the panorama.

Several authors addressed panoramas in the context of cystoscopy, i.e., the examination of the interior of the bladder [78, 144, 212]. The bladder can be modeled as a sphere, so geometric constraints can be imposed. Behrens et al. [17] present a method using graphs to identify missed regions during a cystoscopy, which would lead to gaps in a panoramic image, in order to assess the completeness of the examination. Spyrou et al. [215] propose a stitching technique for WCE frames. In the context of fetoscopy (examination of the interior of the uterus during pregnancy), [190] propose a method to create a panorama of the placenta to support intrauterine fetal surgery. Liao et al. [120] extend this idea by a method to map the panorama to a three-dimensional ultrasound image. A review about recent advances in the field of endoscopic image mosaicing and panorama generation can be found in [19].

5 Conclusion

The main goal of this extensive literature review was to give a broad overview of research that deals with the processing and analysis of endoscopic videos. A further goal is to draw attention to this research field in the Multimedia community. We hope to stimulate further research, especially in terms of post-processing, which is probably the most relevant topic with regard to common Multimedia methods and offers a broad range of open research questions. Moreover, we give insights into domain-specific characteristics of the endoscopy domain and how to deal with them in a pre-processing phase (e.g., lense distortion, specular reflections, circular content area, etc.).

In the literature research, numerous contributions were found and classified into three categories: (1) pre-processing methods, (2) real-time support at procedure time and (3) post-procedural applications. However, many methods and approaches have been found to be relevant for multiple use-cases, e.g., instrument detection and tracking methods developed for robotic assistance that can also be helpful for post-procedural video indexing. Currently, the respective research communities are often not aware that complementary contributions exist in seemingly unrelated research fields.

The domain-specific peculiarities of the endoscopic video domain require specific pre-processing methods like distortion correction or specular reflection detection. Moreover, pre-processing is often used to enhance the image quality or filter relevant content, both temporally and spatially. These enhancements are both important to improve the viewing conditions for physicians and as pre-processing step for advanced analysis methods. In this context, it is also important to distinguish between different types of endoscopy, mainly between diagnostic (examinations) and therapeutic (surgeries) types, but also between the subtypes that often have very heterogeneous domain characteristics. Most approaches focus on one specific endoscopy type and cannot be transferred to other types without significant modifications, i.e., they are strongly domain-specific.

Furthermore, we have to distinguish between methods that are applied during the procedure and methods that operate on recorded videos. The former have the requirement to work in real-time and can only use the information up to the current moment, i.e., they cannot “read into the future”. The latter have this possibility because the entire video is available at analysis time. In terms of diagnostic endoscopy types, the focus is on pattern recognition for Diagnostic Decision Support, mainly in the form of polyp/lesion/tumor detection in colonoscopic screenings. In the special domain of WCE (Wireless Capsule Endoscopy), numerous approaches have been proposed to differentiate between diagnostically relevant and irrelevant content in order to increase time efficiency without impairing the diagnostic accuracy.

The largest part of all found publications originates in the robotics community and has the goal to enable real-time assistance during surgeries by (1) Augmented Reality and (2) (semi-)autonomous actions carried out by surgical robots. These visionary goals require a number of classical Computer Vision techniques like 3D-reconstruction, object detection and tracking, registration etc. Existing methods fail in most cases because of the special characteristics of endoscopic images and aggravating factors like the restricted and distorted viewport, scarce camera motion, specular reflections, occlusions, image quality perturbations, textureless surfaces, tissue deformation etc., and therefore have to be adapted. Many of these techniques could also be useful for post-procedural analysis of videos for more efficient management, retrieval and interactive retrospective review.

The post-procedural use case turned out to be extremely understudied in comparison to the real-time scenario. However, it involves a number of very interesting and challenging research questions, including indexing, retrieval, segmentation, summarization and interaction.

One of the reasons might be that recording of endoscopic procedures is just becoming a general practice and is not yet commonly widespread. This can also be explained by the lack of efficient compression and archiving concepts for comprehensive recording. As a consequence, the acquisition of appropriate data sets is the first challenge for new researchers in this field. Data availability in general is a critical issue since sensitive medical data is involved and hospital policies regarding video recording and transfer are often very strict. Only very few public data set are available and usually target a very specific application (like tumor recognition). In order to draw expressive comparisons between alternative approaches it is very important for the future to create and publish more and bigger public data sets. The extent of a data set is especially important for machine learning methods that require large amounts of training samples. The shortage of a sufficiently large and representative training corpus often hampers the usage of popular techniques like deep learning with Convolutional Neural Networks (CNN) [104].

An additional challenge is to find medical experts who are willing to take the time to cooperate and share their medical expertise, which is an absolutely essential ingredient to successful research in this extremely specific domain. As an example, ground truth labels need to be annotated to training examples. This task cannot be carried out by medical laymen. Moreover, at a first glance, it seems that the demand by physicians is limited because they might see post-procedural usage of endoscopic videos as additional workload, although in fact it has a huge potential for quality improvement. Nevertheless, it is a tough job to familiarize physicians with the benefits of post-procedural usage of videos and the need for research in this area. A survey conducted in [158] showed that physicians often do not have a clear notion of potential benefits until they actually see it in action. After watching a prototype demonstration of a content-based endoscopic video management system, they stated a significantly higher interest in such a system than before.

However, it should not be expected that all problems in the post-operative handling of endoscopic videos can be solved by automatic analysis. An extremely important aspect is to combine these methods with easily understandable visualization concepts and intuitive interaction mechanisms for efficient content browsing and exploration. Especially the latter aspect has rarely been addressed in the literature yet and therefore provides a huge potential for future work.

In the foreseeable future, we assume that video documentation of endoscopic procedures will become required by law. This will lead to huge archives of visually very similar video content and techniques for video storage, retrieval and interaction will become essential. When it comes to that point, research should already have appropriate solutions for these problems.