1 Introduction

Minimally invasive surgery (MIS) methods benefit the patients’ well-being, as they aim at reducing wound healing time, associated pain and risk of infections. These methods have been enabled by the advance of several medical and technical technologies, such as electro-surgery, precision instruments, and imaging technology. Via a small incision, a camera is introduced in the human body enabling surgeons to perform the MIS. During the MIS the surgeon is able to record video as well as image data. The recorded data can later be used for documentation purposes, medical research, education of young surgeons, and for the improvement of surgery techniques. Apparently, huge amounts of data are produced in MIS contexts. Considering full surgeries lasting for hours of routine work and only minutes of medically relevant events [22], surgeons do not have the resources to cope with these huge amounts of data. For documentation purposes, surgeons use recorded images - they thus play a central role for quick assessment of a medical case. There are cases such as education of young surgeons, when a still image does not contain a sufficient amount of information, as the temporal context cannot be mapped on a still image. In all of these cases, the surgeons need to reflect on a video sequence.

The main problem in getting these sequences is that manual browsing in such endoscopic multimedia databases (EMDBs) is a tedious and time-consuming task. We assume that the automatic creation of bookmarks - linking the externally captured images to the correct videos and playback time stamps (PTS) - saves the surgeons’ time at tasks such as education or documentation, thus enabling them to spend more time on tasks such as medical research. It is important to note that in practice there is no way for a trivial interlinking of captured images and recorded videos (as also described in related work [3, 7, 27]). This problem originates from the use of different systems with distinct encoders. Images and video frames suffer from different encoding errors. Moreover, image and video encoding run on different systems with diverging time stamps and clock synchronization is practically hardly feasible. Hence, current systems do not provide the functionality to automatically link the external images to the appropriate video positions, in order to be able to perform a navigation in the video using images of interest. On the other hand, surgeons are not able to manually edit and maintain image to video links themselves, as this task is a tedious process wasting the already short time resources: One full surgery alone can easily last for four hours and more. So, if we consider a surgeon performing a single surgery per day, there accumulate 1000 hours of video material per year and surgeon.

An example use-case is illustrated in Fig. 1. Assume a surgeon wants to revisit a particular scene of a surgery for presentation at a medical congress. Currently, the surgeon has to search for the specific video scene manually within the EMDB. In order to do this manually, this would include a two-step procedure: (1) identifying the correct video and (2) navigating to the desired playback time. Together with the considerations above, this example illustrates that there is a need for an automatic way to retrieve the correct video and PTS. We consider this as a query-by-example problem in the field of image/video retrieval and evaluate several image descriptors for their performance in terms of both, retrieval efficiency computational resource requirements. As already mentioned above, images showing the most important information about particular important scenes are available for documentation purposes. We aim at retrieving the desired information from such a representational image, i.e. we query the EMDB by an image in order to retrieve a video id and PTS. From a practical point of view, our work is driven by the following problems:

  1. P1

    Given a query image, what is the probability to retrieve the correct video?

  2. P2

    Given the video is retrieved correctly, how close is the predicted PTS to the true PTS?

In order to cope with these problems, we propose a binarization scheme for Convolutional Neural Network (CNN) features off-the-shelf reducing the storage space cost while still maintaining a high predictive performance. As baselines for comparison, we evaluate various approaches which are inspired from different points of view. From an encoding perspective, we exploit the fact that image and video are encoded from the same raw image stream. We assume that query image and the video frame of interest only differ by (a) compression loss, or (b) the difference in time between an image and two video frames is that small that there is no significant visual difference. Hence, we use the full-reference visual quality metrics PSNR as distance metric from the query image to each video frame. The same considerations lead to the use of SSIM [35] as similarity measure for the video. From a visual similarity perspective, we model the visual similarity between image and video frames. Using image descriptors, we are able to ignore compression artifacts and map the video frames and query images to a feature space. Of course, the main advantage of such approaches is the computational efficiency: we have to calculate the descriptors for the EMDB only once, followed by a computational cheap distance calculation - while on the other hand, we have to calculate PSNR or SSIM against every frame in the EMDB. As image descriptors, we evaluate CEDD [9] and feature signatures, since these descriptors have shown good performance in related work [3, 7, 27]. Recent literature also shows that CNN features provide an astounding baseline for visual recognition [26]. We include features from the top layers of the network, Neural Codes (NC) [2] as well as Histograms of Class Confidences (HoCC) as baseline for our comparative evaluation.

Fig. 1
figure 1

Linking externally recorded endoscopic images to videos and video playback time stamps within an endoscopic multimedia database. The problem is to assign a video id and playback time stamp to a query image which is assumed to exist as near-duplicate as a video frame

Furthermore, we assume a very practical situation where videos are prepared for adaptive streaming (c.f. [30]). We therefore split the videos of the EMDB into segments of 1s, 2s, 4s, 8s, and 16s duration. The aforementioned approaches are evaluated using only the first frame of these segments acting as a representative frame for this segment against a dense sampling approach. In order to answer P1, we evaluate the approaches on a video-level as a retrieval problem with a single correct video and measure video hit rate, i.e. how often the prediction was correct. We answer P2 on a frame-level basis, by measuring the distance between the predicted PTS and the true PTS if the video was predicted correctly and calculate an average temporal deviation. This work is novel, as to the best of our knowledge, we are the first to provide such a statistically motivated binarization scheme for CNN features off-the-shelf. Moreover, this work extends the work of Carlos et al. [7], Beecks et al. [3], and Schöffmann et al. [27] which also tackle this problem for the domain of endoscopic videos. In contrast to these works, we do not only evaluate whether the video was hit, but we also the correct frame position. Therefore, we have to use a different dataset. Furthermore, our evaluation is more rigorous, as we also thoroughly investigate dense sampling (considering each video frame) against sparse sampling (considering 1 frame per time interval).

The remainder of the paper is structured as follows: Our work continues with Section 2 giving an overview on related work. In Section 3, we propose our approach of binary CNN features off-the-shelf. Section 4 deals with the methods used in this paper, while the evaluation of these methods is described in Section 5. We conclude and point at future work in Section 6.

2 Related work

Video hyper-linking is the more general task to our studied problem. It was already topic to TRECVID [1] and MediaEval Challenges [12]. The main difference to video hyper-linking is essential, as the problem differs in input and desired output dramatically. Instead of a video sequence, the query is an image known to exist as a near-duplicate in a video database as a frame of a video. Moreover, the desired output is a video id and PTS instead of multiple similar video sequences. There is only one correct answer to this retrieval problem (the video sequence the image was shot in). Moreover, our use case is delimited to use visual media (image and video) only, whereas hyper-linking approaches often are able to fall back on other features such as text or audio, e.g. Cheng et al. [10] study the performance of video hyper-linking methods using multiple input modalities including subtitle, meta data, audio, visual features, and their combinations. Other state-of-the-art approaches in hyper-linking are multi-modal and cross-modal approaches based on machine learning: bimodal LDA, deep auto-encoders, bidirectional deep neural networks, and generative adversarial networks. Deep auto-encoders (as proposed by Vukotic et al. [32]) and bimodal LDA approaches (as proposed by Simon et al. [29]) for video hyper-linking are evaluated by Bois et al. [4] with respect to relevance and diversity. The baseline for this evaluation uses a bag-of-words representation for each segment with tf-idf-weighting. Vukotic et al. [33] approach the problem of video hyper-linking with bidirectional deep neural networks. The same authors propose in their further work [34] to use generative adversarial networks. All these learning based state-of-the-art approaches utilize multi-modal or cross-modal input data, e.g. they make use of video subtitles or automatic audio transcriptions. In contrast to these use cases, in the domain of endoscopic video no such additional input modalities are available. Galuščáková et al. [13] investigate visual descriptors for the task of video hyper-linking within a multi-modal approach, i.e. they use visual (feature signatures, AlexNet fc7 CNN features, concept detection, and face recognition) and text-based (subtitles and automatic transcripts) input modalities.

The problem of image to video linking for endoscopic surgeries (strictly speaking: retrieval of a video from an endoscopic video database based on a query image) has already been topic to recent research [3, 7, 27] which are the works most related to our research. In Carlos et al. [7], retrieval based on global features (Color and Edge Directivity Descriptor CEDD [9], auto color correlogram [16], and pyramid histogram of oriented gradients PHOG [5]) fusion methods by rank and score as well as a localized version of CEDD using the SIMPLE approach [17] on every fifth video frame is evaluated. They report that they were able to correctly map 78.3, 78.5, and 79.8% of the queries to the correct video sequence using the approaches Sum of Ranks, Sum of Scores, and SIMPLE-CEDD respectively. The work done by Beecks et al. [3] was done on the same dataset and also utilized a description for every fifth frame. The authors approached this problem by using feature signatures and also compared various signature matching distances to the baseline of Carlos et al. [7]. Their results imply that feature signatures using a unidirectional approach based on L1 distance improved the state-of-the-art by approximately 10%. Schöffmann et al. [27] improved on this work and investigated further distance measures including signature matching distance (SMD), signature quadratic form distance (SQFD), and the earth movers distance (EMD) over various signature sizes. Their results imply that on the used dataset, the correct video segment can be retrieved for more than 88% of the query images.

Ercoli et al. [11] propose a quantization process that is based on the distance of a feature vector to precomputed cluster centroids which are obtained by k-means. They evaluate the threshold measures as geometric mean, arithmetic mean, and nth-nearest distance. Guo and Li [14] use a CNN for supervised learned hashing. They binarize the activation value of a specific fully connected layer with threshold 0. Opposed to this work (which is intended for a slightly different use-case), our approach does not require CNN training. We use a pre-trained CNN model for feature extraction. Moreover, we show that our approach is working well with very similar images, i.e. consecutive video frames. Last, we propose non-zero thresholds for binarization and analyze their impact on retrieval.

3 Binary CNN features off-the-shelf

CNN features off-the-shelf have an excellent record as simple, yet good working approach for content-based description of images. As they are real-valued and high-dimensional image descriptors, they have the drawback of consuming a huge amount of storage space. In recent literature, it was proposed that their dimensionality should be reduced using PCA [26]. The dimensionality reduction could also be achieved using auto-encoders (which has been investigated for global features by our research group [25]). In this work, we propose a method for binarization of CNN features off-the-shelf which was intentionally designed for the task of image to video linking. Instead of reducing the feature dimensionality, we aim at binarization of the feature values. We use basic statistical methods, in order to encode real-valued CNN features off-the-shelf into binary-valued feature vectors of the same dimension.

3.1 Problem formulation

As already mentioned above, the problem of using CNN descriptors off-the-shelf for image linking is the fact that this approach requires a huge amount of storage space. We aim at reducing this storage space requirements by quantization of real-valued CNN feature vectors into binary vectors. As an example, a single AlexNet FC7 descriptor can represented by 4096 float values resulting in a descriptor of 16KB size. A binary version of the AlexNet FC7 descriptor thus could reduce the storage requirements to an equivalent of 128 integer values resulting in a descriptor size of 512Byte. Hence, our approach aims at solving the following research question:

  1. RQ

    How can we binarize real-valued CNN features while still maintaining high retrieval performance?

We define a CNN feature f as an n-dimensional vector \(\mathbf {f} \in \mathbb {R}^{n}\), where n denotes the number of feature components. A feature set \(F \in \mathbb {R}^{k \times n}\) of k individual feature observations is represented by a k × n matrix. Each row Fi,⋅ represents an individual CNN feature observation and each column represents a feature component F⋅,i. Within the context of this work, we state that the problem of binarization of a feature vector is to encode each feature component individually depending on its observed value into a binary value. Hence, our research question aims at finding a pair of intervals I for each feature component mapping the observed value to such a binary value. Hence, the problem reduces to providing a mapping \(q(f; I) \rightarrow \mathcal {Q}^{n}\) from a CNN feature f to a binarized feature with binary values \(b_{i} \in \mathcal {Q}\) depending on parametrized intervals I. The approach for calculating these intervals is described in the following. Please note that for the purposes of our work, we set \(\mathcal {Q} := \{0,1\}\), as we consider a binarization scheme. We want to emphasize that one could use a general quantization scheme by using a set Q consisting of more than two different values. The benefit of choosing a binarization scheme over a quantization with an arbitrary number of values is that the resulting binary feature vectors can be stored in a very compact representation requiring n bit storage only where n denotes the number of dimensions of the real-valued CNN feature which has to be binarized. Moreover, these binary feature vectors can be compared efficiently using computationally inexpensive distance metrics such as hamming distance or jaccard distance which only require bit-wise operations and bit-counting operations.

3.2 Binarization for CNN features off-the-shelf

We assume that for each individual feature component, the observations F⋅,i (for the i-th feature component) follow a random distribution drawn from an unknown population Xi. Furthermore, we assume that the observations for an individual feature component are identically and independently distributed. We do not impose additional assumptions for the relations between individual feature components. The idea behind the binarization process is to assign the real-valued feature observation to one of two categories. We say that an observation for a feature component is higher than the norm when its observed value exceeds the expected value for the respective feature component. Otherwise, the observation is lower than the norm. Based on an available feature set F and the law of large numbers, we estimate the expected value of the feature component j using the arithmetic mean \(\mu _{j} = \frac {1}{k} {\sum }_{i = 1}^{k} F_{i, j}\). Using the calculated intervals \((-\infty , \mu _{j}]\) and \((\mu _{j}, \infty )\), we are able to transform feature f into a binary feature representation. We call a value μj which defines our two intervals of interest a binarization threshold.

Algorithm 1 shows the straightforward binarization algorithm for an EMDB. Given is a real-valued feature set F with k observations of n real-valued feature components. Output is a binary feature set FQ consisting of k binary feature vectors of dimension n. Please note that if we do not want to restrict ourselves on a binary feature vector, we could use empirical quantiles in order to define the intervals for quantization. Furthermore, the algorithm results in transformation thresholds T which are used to transform further features (e.g. new video frames, or query images). For each feature component, we estimate the corresponding arithmetic mean μi. For each observation j, we now set the binary feature value accordingly to these mean values. If the feature value is smaller than or equal to the estimated arithmetic mean, we clear the bit for observation at position i. Otherwise, if the feature value is greater than the estimated arithmetic mean, we set the bit for observation at position i. Whenever new videos are added to the EMDB or additional query images have to be transformed, we use the calculated transformation thresholds T to repeat the binarization for these descriptors. The transformation thresholds remain static until the EMDB has grown significantly. In this case, we suggest applying Algorithm 1 to the new, much larger EMDB. Please note that we use the same transformation parameters for images and videos, as they were captured from the same device and only differ in compression.

figure c

Example

Assume a feature set consisting of 2-dimensional features. For the feature components F⋅,1 and F⋅,2, we estimate the arithmetic means on a big multimedia database: μ1 = .15 and μ2 = .25. Assume, we now want to binarize the first feature observation F1,⋅ = (f1,1,f1,2) = (.14,.39)T. The result is a binary vector of length 2: b := (b1,b2)T. The first feature component f1,1 is smaller than μ1, thus we set b1 to zero. The second feature component f1,2 is greater than μ2, thus we set b2 to one. The resulting binary feature vector thus is Q(F1,⋅) = (0,1)T.

3.3 Usage of other binarization thresholds

Please note that the arithmetic mean is not the only possible choice for binarization threshold. From our point of view, there are two more options, which may be considered: median-value and a constant value of zero. Before we delve deeper into these options, we first analyze CNN features off-the-shelf. As mentioned above, these off-the-shelf descriptors are real-valued vectors with n feature components. Moreover, we observe that the values are sparse, i.e. most values are zero. This originates mostly in the fact that we extract CNN-features after a ReLU layer (which we do throughout this manuscript), which has non-negative output. In our opinion, this sparsity is the key that these features achieve great performance for retrieval. A proposed binarization scheme now should preserve this sparsity, in order to maintain as much retrieval performance as possible. Now, the logical conclusion is to set a constant threshold of 0 for each feature component. This means we set all values which are greater than zero to 1 and every other value to 0. This approach bases on the idea that whenever a component of an image’s feature vector is greater than zero, it is highly descriptive for this image. However, there arises the drawback that small deviations from zero in a single feature component might instantly flip the bit of the binary feature vector at this component. While small deviations do not add much to a distance metric for real-valued CNN features, such as the euclidean metric, a flip of a bit in the binary CNN descriptors of length n means a metric change of 1 within a possible interval of integers between 0 and n. Additionally considering that within our preliminary experiments, binary CNN descriptors (with arithmetic mean) two random frames have an average distance of around 850, such a flip (which may be caused by a very small change of CNN descriptor value) has higher impact on the distance of two frames for binary CNN features than when using real-valued features. This fact is relevant considering consecutive frames, when a feature component rises from zero. For larger sampling intervals we aim to stall this flip in the binary feature vector, as we thus keep being similar to the previous frames for more consecutive frames. Hence, the need non-zero thresholds arises. Furthermore, a single threshold per individual feature component should be used, as individual feature components may behave differently. We earlier stated that the other two other candidates for binarization thresholds are the median value and the arithmetic mean. Analysis of medians of each feature component show that approximately 90% of medians are zero. Thus, binarization with non-zero medians make the feature vector even sparser. However, median calculation is computationally expensive, which is contrary to the aim of low-space usage with our proposed binary features. Considering computation of the arithmetic mean is feasible by using an one-pass algorithm. This has the advantage that not all CNN features of the entire EMDB have to be stored at the same time. In contrast to the arithmetic mean, for the median we would need to hold all CNN descriptors in memory during its computation. Furthermore, arithmetic means of the feature components will be non-zero and greater than or equal to the medians, as we have non-negative values only. This adds to even more sparse feature vectors. Hence, from a theoretical point of view, the additional sparsity in combination with the computational efficiency (compared to median values) justifies our choice for the arithmetic mean over medians and a global zero threshold. Preliminary experiments have shown that the arithmetic mean can not be outperformed by the other two choices of binarization thresholds. Hence, we evaluate our proposed approach with arithmetic means only.

4 Methodology

This section presents the used dataset and an overview on the used methods to link images to videos and PTS in the EMDB. Furthermore, we present the results of the methods on this dataset that has been generated in cooperation with medical experts at the regional hospital (LKH) Villach.

4.1 Ground truth generation

Our medical partners have provided 221 short video sequences and 115 images of gynecologic interventions. The images were selected due to medical considerations, e.g. showing pathology. The video sequences and images were produced before our cooperation on this linking use-case. Hence, no selection bias is introduced. The video sequences feature a spatial resolution of 1920 × 1080 pixel at 25 frames per second and are H.264 encoded. On average, a video is 1m28s long with a standard deviation of 1m12s. The median video is of 1m9s length. In total, the video sequences feature a length of approximately 5h27m (490720 frames in total), whereas the individual video length ranges from 1s to 4m11s. For some images there exists no corresponding video, because the scene the image was taken in has not been recorded. We exclude these images, as well images that have been recorded out-of-patient (for considerations of privacy issues), in the final evaluation dataset. Examples for the query images are given in Fig. 2 showing an image during a laser treatment, two images of pathology examination in the begin of the surgery (situs) and during the surgery. The laser treatment image is considered to be easier to link correctly, as for the application of the laser a different optic is inserted in the endoscope and the image is darker in general. Often it also features a black circle while other scenes don’t. The situs images are most relevant as they allow for a begin and end examination of the patient. We asses their difficulty as medium. Images during the surgery are considered as difficult to link, as they have the most candidate frames. For annotation, we calculate frame-per-frame PSNR of video frames against the images serving as a reference. For each image, we extract 3 top candidates and manually compare these candidates to the reference image. This procedure allows for frame-accurate linking of the images to the video and has the advantage that it is executable by a non-expert user. Using this procedure, we have annotated 69 images together with their true PTS from 38 different videos. The dataset comprises 64 annotations matching a video frame exactly and 5 annotations of images, which have been shot ‘between’ two video frames. The correctness of these 5 frames has been inferred on the basis of instrument positioning, tissue, and reflections in retrospective. Our explanation for the latter is as follows: The videos are stored at 25 frames per second and the recording camera (which is also responsible for recording the images) has an output of 50 frames per second. For this fact, we consider that arbitrary frames are taken, which are not in the video, but temporally ‘between’ two frames. The dataset is a little smaller in total playback duration compared to former works in this field and much smaller than the amount of data which would pile-up in a real-world application considering that a single intervention may easily last hours. Nevertheless, our dataset is representative. The benefit of the smaller dataset is that the its size allows for evaluating a dense sampling method, i.e. a search for each image frame, and compare its performance against sparse sampling of frames, as well as full-scale PSNR comparison in reasonable time.

Fig. 2
figure 2

Examples for query images: the leftmost image shows laser treatment, while the other three images show pathology examination in the begin and during the surgery

4.2 Encoding perspective approaches

During endoscopic surgery, raw image data is delivered to a so-called frame-grabber. All video data is shown to the surgeons in real time on a monitor in the operating room. Additionally, the surgeon is able to control the frame grabber in order to record and store video and image files. We therefore model the video frame fV and image fI as the base image f distorted by an additive compression error εV and εI respectively. The different compression errors are introduced by different compression methods for video and image compression. PSNR and SSIM are full-reference image quality metrics, aiming at modeling the distortion of frames suffering from such compression errors. While PSNR is based on a per-pixel mean square error on a logarithmic scale, SSIM exploits image structure [35]. For two m × n images x and y we compute PSNR as follows:

$$PSNR(x, y) = 10 ~ log_{10}\left( \frac{255^{2}}{\frac{1}{m*n} {\sum}_{i = 0}^{m-1} {\sum}_{j = 0}^{n-1} |x(i,j)-y(i,j)|^{2}} \right) $$

SSIM for two images x and y is computed as follows:

$$SSIM(x,y) = \frac{(2\mu_{x}\mu_{y}+c_{1})(2\sigma_{x,y}+c_{2})}{\left( {{\mu}_{x}^{2}}+{{\mu}_{y}^{2}}+c_{1}\right)\left( {{\sigma}_{x}^{2}}+{{\sigma}_{y}^{2}}+c_{2}\right)} $$

where μ. denotes the average value, \({\sigma }_{.}^{2}\) the variance, and \({\sigma }_{.,.}^{2}\) the covariance. We assume PSNR and SSIM to model compression errors in a monotonic way: the more the distortion, the worse the metric value. For our case of two distorted frames, we conclude that the near duplicate images fI and fV have a better metric value than dissimilar video frames. Thus, we calculate the metric value between the query image and each frame in the database and select the frame with the best metric value as result. We expect these methods to provide good results and act as a baseline for further experiments. Nevertheless, these methods are not scalable with respect to resolution and computationally very expensive, as every query is followed by a metric calculation for each frame in the EMDB.

4.3 Visual similarity perspective approaches

These approaches leverage visual similarity of the images in order to link a query image to a specific video frame in the EMDB. As baseline, we use the global image feature CEDD [9]. CEDD uses a color unit for extracting color information and a texture unit for texture information. The image is separated into 1600 image blocks. Each of these blocks is classified into on of six texture bins by the texture unit. The color unit provides a fuzzy classification of each block into 24 bins. Finally, the results for each block is aggregated into a 6*24 = 144 bin histogram. This histogram is quantized into three bits per bin resulting in a descriptor of 432 bit size. For retrieval of the correct video frame with CEDD, we calculate the Tanimoto distance between two descriptors (as suggested for CEDD) and retrieve the frame with minimal distance to the query image. For metric and distance calculation, we use the LIRE [21] library.

As second approach, we use feature signatures. Feature signatures are obtained by clustering image features, such as color, position, or texture. The cluster representatives and weights are then stored and used as descriptor. We calculate feature signatures using OpenCV [23] and default parameters. As these signatures do not necessarily have the same dimension, we have to use a special distance metric for the distance calculation. In particular, we use SQFD (signature quadratic form distance; as investigated in [20]).

We also investigate the deep learning methods based on CNN features off-the-shelf, as well as binary CNN features as described in Section 3. For these CNN-based approaches, we use the CNN models AlexNet [19] and GoogLeNet [31] initialized with weights from the Caffe Model Zoo [6] (berkeley-trained). We choose these two network architectures, as we already know from our previous work [24] that they perform well on the domain of endoscopy. Furthermore such computationally inexpensive networks have the benefit that the processing is finished shortly after the end of the surgery. For a real-life scenario, this means a surgeon performing an intervention before noon is able to use CNN feature-based methods in the afternoon to write a case documentation. Furthermore, we use the ResNet [15] models (ResNet50, ResNet101, and ResNet 152) initialized with the provided weights in order to get an comparison with (on ImageNet) even better performing models.

The approach CNN features off-the-shelf uses features extracted from the aforementioned CNNs. Specifically, we use the results of the CNN output layer, in order to get (probability) Histograms of Class Confidences (HoCC). From deep, fully-connected layers near to the top of the CNN architecture, we extract the neurons activations. We henceforth denote these activation values as Neural Codes (NC) (c.f [2], [8]). We use these uncompressed CNN features off-the-shelf HoCC and NC as image descriptors. We compare these feature vectors using three different distance functions: max, euclidean, and manhattan distance. For CNN feature extraction, we use the Caffe framework [18]. In particular, we extract these CNN features from the last three layers of the AlexNet architecture: prob for HoCC, fc7, and fc6 for NC. We use the extracted descriptors for NC and HoCC in an uncompressed way. They feature a dimension of 4096 (fc6), and 4096 (fc7). The HoCC approach features a dimensionality of 1000, as the models were trained for ImageNet classification into 1000 different classes. As the network has an input size of 224 × 224 pixel, we resize the input images such that the smaller side has a length of 224 pixel. We then use the center crop as input for the CNN. This is reasonable in the endoscopic domain as the camera always focuses on the most important aspects and thus they are centered in the image as well as results from [8] indicate that this method is superior to padding and naive scaling. We follow the same approach for GoogLeNet, but use the 1024-dimensional bottleneck features extracted from layer pool5/7x7_s1 as NC. For the ResNet models we use the 2048-dimensional bottleneck features extracted from layer pool5 as NC.

For the last approach, we transform uncompressed CNN features off-the-shelf into binary CNN features using Algorithm 1. The envisioned binarization process is illustrated in Fig. 3 and consists of two steps: CNN feature extraction and binarization.

Fig. 3
figure 3

Binarization of CNN features: extraction of real-valued CNN features and binarization using pre-calculated binarization thresholds

We found in our preliminary studies that various distance metrics (i.e. hamming, dice, and jaccard) for binary vectors perform equally good in the case of binary CNN features. Hence, for the purposes of this work, we present the evaluation of the hamming distance only.

4.4 Evaluation methodology

We propose to evaluate the performance of the individual approaches in two different dimensions relevant to the task of image to video linking: video hit rate vHR as well as average distance to the correct PTS dPTS. In order to thoroughly explain the performance metrics, we formulate the goal behind the use case: Provide a function l that links a query image qI to a single video vV and a PTS \(p\in \mathbb {R}\) within an EMDB with image set I and video set l(q) = (v,p). For evaluation, we have a ground truth g(q) available which was generated as described in Section 4. We define a set of correctly classified images:

$$C := \{ ~ i ~ | ~ i\in I \land g(i) = (v, p_{1}) \land l(i) = (v, p_{2}) \} $$

We define video hit rate (or recall) as number of images which were linked to the same video as in the ground truth relative to the number of all query images:

$$v_{HR} := \frac{ | C | }{| I |} $$

Based on the same notation, we calculate the average distance to the corrects PTS as following. We define the set of PTS deviations D as:

$$D := \{ |p_{1}-p_{2}| ~|~ i\in C \land l(i) = (v, p_{1}) \land g(i) = (v, p_{2})\} $$

For evaluation of average distance to the correct PTS, we calculate the mean of the aforementioned set:

$$d_{PTS} := \frac{1}{|D|} \underset{d\in D}{\sum} d $$

Please note that for the calculation of average distance to the correct PTS, only images which were linked correctly are used.

In contrast to previous work, which used a constant sub-sampling of the surgical videos, we evaluate dense and sparse sampling for the methods described in Section 4. For dense sampling, we link a query image to a video and PTS and consider every frame as possible candidate. For sparse sampling, we divide the videos in segments and use the first frame of each segment as key frame. The segment sizes were chosen according to commonly used segment sizes in adaptive streaming: 1s, 2s, 4s, 8s, and 16s, as already mentioned in Section 1.

5 Evaluation and discussion

A graphical overview on the performance of the individual approaches is given in Figs. 456, and 7. A comparative overview between the best performing instances of these approaches is given in Fig. 8. The top rows of these figures illustrate the video hit rate. For the queries which were answered correctly, the standard deviation to the true PTS in seconds was evaluated and illustrated together with the appropriate standard deviations to the true PTS. Please note that in some cases, the standard deviation is very high and thus cut off in the figures, this is for the sake of readability. For an extensive overview on the retrieval performance, please refer to Table 1 for video hit rate and Table 2 for deviations to the true PTS. In the following, we discuss the results for hand-crafted and CNN-based approaches individually. We compare these groups of approaches and point at advantages and drawbacks while comparing their retrieval performances. Eventually, we compare the approaches on the dimension of computational requirements.

Fig. 4
figure 4

An overview on the performance of the hand-crafted approaches: PSNR, SSIM, CEDD, and feature signatures. The X-axes denote sampling strategies: Dense, 1s, 2s, 4s, 8s, and 16s The Y-axis of the top diagram denotes video hit rate, the bottom diagram illustrates the average distance from the found video position to the correct playback time in seconds, whenever the correct video was hit. Error bars denote standard deviations

Fig. 5
figure 5

An overview on the performance for CNN features off-the-shelf: Histogram of Class Confidences (HoCC) and Neural Codes (NC) extracted from layers fc6 and fc7 for the AlexNet model as well as HoCC and NC of GoogLeNet initialized with weights from the Caffe model zoo. The X-axes denote sampling strategies: Dense, 1s, 2s, 4s, 8s, and 16s. For the top row, the Y-axis denotes video hit rate. Higher is better. The bottom row illustrates the average distance from the found video position to the correct playback time in seconds, whenever the correct video was hit. Smaller is better. Error bars denote standard deviations

Fig. 6
figure 6

An overview on the performance for the CNN features off-the-shelf: Histogram of Class Confidences (HoCC) and Neural Codes (NC) for the ResNet models ResNet-50, ResNet-101, and ResNet-152. The X-axes denote sampling strategies: Dense, 1s, 2s, 4s, 8s, and 16s. For the top row, the Y-axis denotes video hit rate. Higher is better. The bottom row illustrates the average distance from the found video position to the correct playback time in seconds, whenever the correct video was hit. Smaller is better. Error bars denote standard deviations

Fig. 7
figure 7

An overview on the performance of binary CNN features: Histogram of Class Confidences (HoCC) and Neural Codes (NC) for AlexNet, GoogLeNet, and ResNet. The X-axes denote sampling strategies: Dense, 1s, 2s, 4s, 8s, and 16s. For the top graph, the Y-axis denotes video hit rate. Higher is better. The bottom graph illustrates the average distance from the found video position to the correct playback time in seconds, whenever the correct video was hit. Smaller is better. Error bars denote standard deviations

Fig. 8
figure 8

An overview on the performance of PSNR, feature signatures, AlexNetFC7 with euclidean distance and Binay AlexNet FC6 features. The X-axes denote sampling strategies: Dense, 1s, 2s, 4s, 8s, and 16s. For the top graph, the Y-axis denotes video hit rate. Higher is better. The bottom graph illustrates the average distance from the found video position to the correct playback time in seconds, whenever the correct video was hit. Smaller is better. Error bars denote standard deviations

Table 1 Experimental results for video hit rate vHR of the various approaches, using different sampling intervals and distance metrics
Table 2 Average deviation dPTS (in seconds) from the true playback position for successfully retrieved videos for the various approaches, using different sampling intervals and distance metrics

5.1 Hand-crafted approaches

In this section, we take a detailed look at the hand-crafted approaches PSNR, SSIM, CEDD and feature signatures. An overview on their video hit rate and temporal deviation performance is depicted in Fig. 4. The two visual quality metrics PSNR and SSIM deliver reasonable results for this kind of retrieval task. At dense sampling, both provide good results. PSNR even manages to perform perfectly in terms of video hit rate, whereas SSIM delivers a slightly better performance in terms of temporal deviation from the true PTS. Considering larger sampling intervals, the video hit performance of SSIM drops more drastically than PSNRs. Interestingly, SSIM is very precise within a found video until a sampling size of 8 s: whenever SSIM identifies the correct video, the true PTS is (on average) also retrieved within the sampling interval. On the other hand, PSNR achieves high video hit rates until a sampling rate of 4s (where still 78.3 percent of the query images are linked to the correct video). The average distance to the true PTS is very small compared to the sample size, but has a standard deviation of approximately 20 s for 2, 4, 8, and 16 s sampling intervals. We conclude that the dramatic performance decrease of SSIM origins in the fact that the structural features of consecutive frames change a lot. PSNR on the other hand uses a pixel-based averaging which proofs to be more resilient in terms of larger sampling intervals. A clear drawback of these two approaches is the high query complexity: a full-resolution calculation of PSNR and SSIM is much more expensive than distance calculations between feature vectors.

In contrast to PSNR and SSIM, the CEDD descriptor is a precomputed feature vector representing a single image. Despite its computational simplicity, this method features a maximum video hit rate of .855 which is approximately the same performance as PSNR at a sampling rate of 2s. The performance drop with growing sampling intervals is comparable to PSNRs. With an average distance of roughly 10 seconds CEDD is not that precise in determining PTS. Within our evaluation this approach also exhibits a high standard deviation while determining the PTS. Compared to PSNR and SSIM, CEDD is able to quickly find the correct video when applied with dense sampling, because only 432 bit descriptors have to be compared. However, CEDD does not provide a precise localization of the PTS within the correctly retrieved video.

Feature signatures with signature quadratic form distance as distance function is the last hand-crafted approach that is evaluated in this work. For dense sampling, feature signatures surpass the performance of CEDD and nearly achieve the performance of SSIM and PSNR. With medium sampling intervals (1 to 4 s), feature signatures’ performance is comparable to CEDDs. For higher sampling intervals than 4s, CEDD provides better results, as we conclude that the abstraction of CEDD works better for such big intervals. In order to be able to make a sanity check compared to related work, we evaluate a sampling interval of 5 frames for feature signatures as it was used in previous work. We achieve similar performance as [3] and [27] with a video hit rate of .899 compared to .88 on the other dataset. In this case, the average distance to the true PTS for found videos is 3.41s with a standard deviation of 7.96s.

We conclude that for a dataset, where dense sampling or small sampling intervals are feasible, feature signatures are the hand-crafted approaches of choice because of their trade-off between performance and speed. They are faster than PSNR and SSIM but a bit slower than CEDD, but outperform CEDD in terms of video hit rate and temporal deviation from the PTS.

5.2 CNN-based approaches

Overviews on the results for CNN-based approaches are given in Fig. 5 for CNN features off-the-shelf from the AlexNet and GoogLeNet architectures, Fig. 6 for CNN features off-the-shelf from the ResNet architectures, and Fig. 7 for binary CNN features. Advantages of using CNN-based approaches is their easy applicability, as a plethora of deep learning frameworks and pre-trained models allow for easy use. A drawback in comparison to the hand-crafted methods is that on systems without GPU support, CNN-based methods are of exorbitant computational cost. In the following, we deal with CNN features off-the-shelf and binary CNN features separately.

5.2.1 CNN features off-the-shelf

The HoCC approach using AlexNet model achieves a recall of up to 97%. Generally, their performance drops with larger sampling intervals. From Fig. 5, it is apparent that manhattan distance works best for this approach in terms of video hit rate. Considering the standard deviations of the average deviation from the true PTS, we can not state that one distance metric works better than another for this dimension. The only exception from this observation is that manhattan distance seems to work better for dense sampling. The retrieved PTS with the Manhattan distance, which achieves the best results for HoCC, feature an average value of up to 5 seconds and a standard deviation of less than 10 seconds for sampling intervals of up to 4 seconds. For GoogLeNet and the three ResNet models, we see the same trend in terms of used distance function. Manhattan dominates euclidean distance which dominates maximum distance. We also observe the same behavior for PTS retrieval as when using AlexNet. While comparing HoCC for the CNN architectures AlexNet and GoogLeNet across sampling intervals, we observe that AlexNet outperforms GoogLeNet. This observation is especially visible in terms of deviation to true PTS. We observe the same for the ResNet models, whose HoCC features perform worse than AlexNet and GoogLeNet. We think this originates in the fact that GoogLeNet and ResNet architectures are more specialized for general images such as they occur in the ILSVRC dataset.

Neural Code approaches for AlexNet provide best results for dense sampling out of all CNN-based approaches. Using features extracted from layer fc7 and dense sampling, the maximum video hit rate of 100% is achieved regardless of the distance metric. All in all, the Euclidean distance works well for fc7, the Manhattan distance is able to compete with and occasionally surpass the Euclidean distance using features from fc6. Regarding distance to the true PTS, the standard deviations of Manhattan and Max distances are comparable. The Max distance works worst across each test case. For GoogLeNet, euclidean distance works best over all sampling intervals in terms of video hit rate. With small sampling intervals, the manhattan distance provides better results in terms of temporal deviation to the true PTS, while the max distance works best for large sampling intervals in this dimension. With NC, we observe that GoogLeNet features perform slightly worse than AlexNet features, especially over larger sampling intervals. We think that originates in the fact that feature vectors from AlexNet are with 4096 dimensions four times larger than GoogLeNet’s with a dimensionality of 1024. With ResNet, we observe that euclidean and manhattan distances work best. For dense sampling, there is no difference in the performance of the three ResNet models, but with larger sampling intervals, ResNet-101 outperforms the other two model variants in terms of vHR, while there is no clear best model in terms of dPTS. Compared to AlexNet, we observe that the best ResNet model (ResNet-101 with euclidean distance) performs approximately as good as AlexNet features across all sampling intervals. We conclude that this originates in the fact that 2048 dimensional ResNet-101 features are as discriminative as 4096 dimensional AlexNet features for this use case. However, AlexNet still performs better than ResNet in terms of dPTS.

5.2.2 Binary CNN features

The evaluation results imply that for video hit rate, binary neural codes slightly outperform binary HoCC features by at least 1.4% (94.2% against 92.8% for dense sampling). We furthermore observe over growing sampling intervals, that binary neural codes’ performance decreases slower than binary HoCC’s performance. The binarization process leads to a increase in performance of GoogLeNet relative to AlexNet. Binary neural codes and dense sampling work equally good in terms of video hit rate using AlexNet and GoogLeNet. At higher sampling intervals, the higher-dimensional features of AlexNet perform better. The observation that AlexNet’s features provides better results in terms of temporal deviation to the true PTS also holds for the binary version. All in all binary AlexNet features from fc6 achieve best performance within this category. For the three ResNet models, the experiments show that NC works better than HoCC over all evaluated sampling intervals. As with the real-valued CNN features off-the-shelf, ResNet-101 performs best, with the exception of 16s sampling intervals, where it is outperformed by ResNet-50. Compared to AlexNet, binary ResNet features perform slightly worse. When compared to GoogLeNet they perform approximately equal for small sampling intervals and slightly better for larger sampling intervals. We conclude that this performance hierarchy for large sampling intervals originates in the number of dimensions (1024, 2048, and 4096 for GoogLeNet, ResNet, and AlexNet respectively) of the used features.

5.3 Retrieval performance comparison and computational analysis

The results for PSNR, feature signatures, CNN features off-the-shelf extracted from AlexNet fc7 and compared with euclidean distance, as well as binary CNN features from AlexNet fc6 are compared in Fig. 8. At dense sampling, uncompressed CNN features and PSNR outperform the other approaches. When the sampling interval is set to 1s, binary features catch up with and nearly reach PSNR’s performance. At a sampling interval of 2s, binary features surpass PSNR in terms of video hit rate. At sampling interval 4s, we observe that binary features even outperform uncompressed CNN features. Binary CNN features also yield the best performance of all approaches compared in Fig. 8 for sampling intervals larger than 4s. Furthermore, we observe that no other approach performs significantly better than binary CNN features at PTS retrieval. Please note that binary CNN features do not outperform CNN features off-the-shelf for high sampling intervals in every combination of distance metric, CNN architecture and extraction point. Rather we conclude that binary CNN features perform on par with CNN features off-the-shelf for high sampling intervals. We state that preserving the sparsity of CNN features off-the-shelf and stalling binary vector components’ flipping for consecutive frames with non-zero thresholds (see Section 3.3) is the key that binary CNN features maintain high retrieval performance. The results for average deviation from the playback time stamp indicate that at higher intervals, binary CNN descriptors are slightly worse than off-the-shelf descriptors for the very reason of stalling these flips.

Table 3 shows a comparison of the approaches from the viewpoint of their computational resource needs. For SSIM and PSNR, we compute the average comparison time based on with 500,000 individual comparisons. As SSIM and PSNR do not impose storage requirements, the entry for these categories is empty. SSIM and PSNR have an infeasible computational cost for comparison, so that they can only be used for very small databases. For a five hour surgery, the linking of a single image would take hours. For the measurement of similarity computation time of binary CNN features, we implement the bit counting function of hamming distance computations with three different versions. In version 1 (v1), we count the bits using Brian Kernighan’s bit count algorithm. The algorithm uses a loop that clears the least significant set bit until no bits are set. The number of loop executions is the number of set bits. The loop is executed at most per once per bit, or put differently for a 4096 bit descriptor, the loop is executed at most 4096 times. Version 2 (v2) and version 3 (v3) use a precomputed lookup table of 8 bit and 16 bit respectively. The lookup table is realized as an integer array of size 28 for 8 bit approach and 216 for 16 bit approach. The lookup table contains the number of set bits for the corresponding index. The input descriptors are sliced into parts of 8 and 16 bits respectively and the sum of set bits for each slice is the number of total bits set for a given binary number. Thus, for the 16 bit lookup table, a binary number of size 16n is sliced into n parts of 16 bits. Hence, we compute the number of set bits in a binary number using n lookups and additions. For example, a 4096 bit number (that is the result of an xor operation of two 4096 bit binary CNN descriptors) requires 256 lookups and additions. Binary features of dimension 4096 compared with hamming distance v1 (Brian Kernighan’s bit count) show an average distance calculation in 6.49 μ s. Distance comparison of two 432 bit CEDD descriptors using the Tanimoto distance implemented in LIRE performs even better with an average of 1.80 μ s. Optimizing the hamming distance calculation for binary features with lookup tables leads to a performance increase: hamming distance v2 (8 bit lookup table) and hamming distance v3 (16 bit lookup table) on average only take 0.94 and 0.64 μ s respectively for distance computation. Please note that CEDD comparison was performed using the LIRE library while the other approaches have been implemented in C+ +. Furthermore we ignore descriptor extraction time (which will be dealt with qualitatively) as well as pre-computations, e.g. calculating binarization thresholds, or initialization of lookup tables. Whereas modern GPUs allow for efficient extraction of CNN features, extraction of hand-crafted features is computationally expensive. Within the CNN-based approaches, descriptor extraction is most efficient for AlexNet and GoogLeNet. Furthermore, we observe that CEDD with its 432 bit descriptor requires very little storage space. Feature signatures are variable-length image descriptors. In our experiments the average size of a features signature is about 1.5KB. Binary CNN features require 1000, 1024, 2048, and 4096 bit storage space per descriptor for HoCC, GoogLeNet NC, ResNet NC, and AlexNet NC respectively. We want to emphasize that this is a huge decrease in required storage compared to real-valued CNN descriptors which take 4000, 4096, 8192, and 16384 Byte per descriptor for HoCC, GoogLeNet NC, ResNet NC, and AlexNet NC respectively.

Table 3 Overview on computational requirements of selected approaches: SSIM, PSNR, CEDD, feature signatures, CNN features compared with euclidean distance, and binary CNN features compared with hamming distance, where the bit counting function is implemented in three different ways

Based on estimations from [28], we assume for an exemplary hospital with 5 departments and two operation rooms generates 60 h video per day. Over a year, that would be over 20,000 h of endoscopic video recordings. For 20,000 h of video and a sampling rate of 16 s, the hospital system is required to store 4.5 million image descriptors. For GoogLeNet NC features, this would mean 17.16 GB of image descriptors alone, while a binarized version of the same features (that would perform approximately 6% worse in terms of video hit rate: .565 vs .507) would take about 550MB. From another point of view and the same features, extraction of a binary descriptor every second would require 8.5GB descriptor size and those binary features would perform approximately 33% better (.564 vs .899) than the CNN features off-the-shelf at half the size.

6 Conclusion

In this work, we investigate the problem of linking endoscopic images and video sequences within an endoscopic multimedia database. Therefore, we propose a binarization scheme for off-the-shelf CNN features aiming at straightforward applicability and low computational resource cost. We perform an extensive evaluation against baselines of hand-crafted methods PSNR, SSIM, CEDD, and feature signatures, as well as uncompressed CNN features off-the-shelf features. Our evaluation shows that state-of-the-art methods have severe drawbacks in this task. PSNR and SSIM provide a solid baseline in terms of retrieval performance, but are computationally not feasible for large EMDBs. The content descriptor CEDD and feature signatures do not have these problems, but yield a mediocre performance at dense sampling of the EMDB frames. PSNR and off-the-shelf features work best in terms of video hit rate and distance to the true PTS. The proposed approach - binary CNN features off-the-shelf - maintains only a slight performance loss for dense sampling intervals compared to uncompressed CNN features. It reaches and partially outperforms the performance of other approaches (e.g. feature signatures) at sparse sampling intervals. Binary CNN features are able to achieve high retrieval performance, as our binarization scheme aims at preserving the sparsity of CNN features off-the-shelf. Moreover, it allows for similar feature vectors of consecutive video frames which is beneficial at larger sampling intervals. Furthermore, binary CNN features have a significantly reduced storage space requirement and distance calculation complexity compared to uncompressed CNN features off-the-shelf. Future work in this topic may consider reducing the dimensionality of already binarized features without loss of accuracy. Furthermore, for future work we aim at automatically detect shot boundaries of relevant surgery scenes automatically.