Introduction

The tracking and recognition of surgical instruments plays a crucial role in many clinical and medical processes. Knowing which instrument was at a particular point of time during an operation can contribute to the success of the operation and provide clarity in case of failure. Furthermore, this information also serves the direct preparation and revising of the surgery in the operating room itself. They also help to improve the efficiency of the processes in a central sterile services department (CSSD). Figure 1 shows a typical cycle of reusable surgical instruments in hospitals.

Fig. 1
figure 1

Cycle of surgical instruments within a hospital

Surgical instruments are transported in so-called trays between operating rooms and the CSSD. It is precisely defined which instruments belong to which tray. The lack of a surgical instrument or the presence of a wrong instrument can risk the success of an operation and must be avoided in all preparation processes [1]. In follow-up processes, the missing of a surgical instrument immediately after the operation can have serious consequences. If the instruments are mixed in the trays after an operation, the sterilization processes significantly lose efficiency. In consequence, the recognition and tracking of surgical instruments is an important aspect that can be prone to mistakes if carried out purely by humans.

It is important to distinguish the level of detail in which the recognition of surgical instruments takes place. A simple breakdown of instruments into main classes provides no added value. Whether the instrument is a pair of scissors or a clamp is basically an important information. But not all clamps fulfill the same function. An assignment of the manufacturer and the item number is necessary to ensure the required functions of the instruments for an operation. In contrast, a unique identification is not necessary, since instruments that have the same item number are basically the same medical device and therefore provide the same function (cf. Figure 2). In this work, it is first investigated whether the partially very similar but also very different surgical instruments can be recognized under controlled environmental conditions.

Fig. 2
figure 2

Classification of surgical instruments with respect to different levels of detail

Scientific question

In cooperation with Charité CFM Facility Management GmbH, the following scientific question was investigated in this work: Is image-based recognition of surgical instruments at item number level possible using only machine learning and image processing methods?

State of the art

The state of the art addresses current techniques and systems that are used marker-based recognition, track and trace of surgical instruments. The field of object recognition with image processing methods based on machine learning is also reviewed, especially convolutional neural networks (CNNs).

The authors are not aware of any publication or method that has successfully tested surgical instrument recognition based on convolutional neural networks to date.

Marker-based technologies

Marker-based recognition means that an additional element must be added to each instrument so that a technical system can recognize this instrument. A separation is made between non-image processing technologies and image processing technologies.

Non-image processing technologies

Non-image processing technologies use radio frequency identification (RFID) techniques. These systems consist of a reader and a passive transponder for each instrument. The reader can contactlessly read the passive transponder and thus identify and localize the transponder. There are already established products available on the market for this purpose [2,3,4,5,6].

Image processing technologies

Image processing technologies use visual markers as QR codes or bar codes [7]. Several publications address the track and trace of surgical instruments with image processing. But the focus is on the use in the operating room during an operation. Therefore, the motion of an object is tracked. The knowledge about the item number of an instrument must be set manually in a previous step [9,10,11].

Advantages and disadvantages

The technologies presented have the advantage that they can identify each instrument uniquely, i.e., much more precisely than the item number-wise recognition. The disadvantage is that such detailed recognition is often not necessary and each surgical instrument must be specially prepared for this purpose. Due to the sterilizing process of the instruments, markers such item numbers, QR codes or bar codes wear out and are no longer readable. In addition, modified instruments must be re-certified in order to be used.

Marker-free technologies

Numerous publications have already shown that modern methods of image processing from the field of machine learning achieve great success in object recognition [8, 12,13,14,15]. The methods recognize objects simply based on their visual appearance. Recognition is formulated as a classification problem. An image is assigned to a class, i.e., an item number.

So-called convolutional neural networks (CNNs) perform feature extraction and subsequent classification using a holistic approach. Their special advantage is that no manual modification of weights or parameters is necessary. The CNN learns all weights during a training process using a large amount of training data. The network architectures used for this work are presented below.

Density networks

The density network (short: DenseNet) is characterized by the fact that the features of all preceding layers are available as input for each layer [13]. This results in a very high density of information. Classically, CNNs only receive the features of the previous layer in each layer. This has the disadvantage that calculations in CNNs take place several times and the same features can be calculated several times at different locations in the network. The DenseNet can thus calculate better features more efficiently. The authors compare different networks sizes with respect to the number of layers (the depth of the network). The version with 201 layers gives the best performance between recognition accuracies and size of the network. Therefore, DenseNet-201 is chosen for this work.

Efficient networks

The efficient network (short: EfficientNet) is a partially automated CNN architecture [14]. Machine learning algorithms have optimized an already existing architecture at various points: width, depth and resolution of the network. The authors present eight different variants with respect to the number of weights of the network. The network with the highest number of weights is version B7. It reaches the highest recognition accuracies. Therefore, EfficientNet-B7 is chosen for this work.

Big transfer models

The big transfer model (short: BiT-M) is a modified version of a residual network [15]. Its special feature is that it was pre-trained on a data set comprising 21 million images [13].

Concept

The presented work addresses the recognition of surgical instruments on images. The use of machine learning methods requires the acquisition of training, validation and test data. This section describes how these data are obtained and provided. Furthermore, the design of experiments is presented.

Acquisition setup

An acquisition system with three industrial cameras is used for data acquisition. The cameras focus on the instrument at three different angles: Top view = \(0^\circ \), 2nd view = \(30^\circ \) and 3rd view = \(45^\circ \). A \(90^\circ \) view for a side view was omitted, since the instruments are characterized mainly in length and width which is addressed with the top view. Essential recognition features are not expressed in height for most of the instruments. The instrument tray is homogeneous, white and non-reflective. The lighting is diffused to avoid shadows cast by the instruments. The colored RGB images have a resolution of \(896 \times 1024\) pixels (width \(\times \) height).

Data set

A total of 156 different surgical instruments are used in this work. They provide a wide range of sizes and types. Figure 3 shows example images of surgical instruments which look very similar and very different. Fig. 4 shows the distribution of main classes across all used instruments.

Fig. 3
figure 3

The top row shows surgical instruments with different article numbers that are very similar. The bottom row shows surgical instruments that look very different

A total of seven different positions and two different orientations were used to cover all perspectives of the instruments. Using three cameras, this results in \(7 \times 2 \times 3 = 42\) images per instrument. Only one instrument is presented to the acquisition system at any given time. The instrument is captured with different orientations (e.g., lying on its side) and in different positions (e.g., central or left in the image). The entire image set of one instrument is shown in Figure 5. Note that the first and seventh positions are very similar as they describe the intuitive position.

Fig. 4
figure 4

Distribution of the main classes for the used instruments in this work

To train the CNNs, the data set is split as follows:

  1. 1.

    Training data These data are used to learn the patterns of objects to be recognized. The CNNs use the training data to automatically tune their weights during the training phase.

  2. 2.

    Validation data About 25% of the training data is randomly selected and held out to the CNN during training. This allows testing how the CNN behaves with unknown data. The validation data does not follow a fixed scheme. The higher the recognition accuracy in the validation data the higher the generalization of the algorithm.

  3. 3.

    Test data These data consist of six images per instrument and describe the intuitive position of how a worker would present an instrument to the acquisition system (straight orientation, central in the image). The higher the recognition accuracy in the test data, the higher the specialization of the algorithm to the use case.

The total amount of 6552 images is split into 4496 training images, 1120 validation images and 936 test images (percentage distribution: 67/17/14). The final recognition accuracy for the test data is the deciding criterion.

Fig. 5
figure 5

The 42 images for one single instrument are shown here. The colored boxes show which images are used for training, validation and test data. Details can be found in the following section

Design of experiments

The CNNs presented in “Marker-free technologies” section consist of several million network weights, all of which are tuned automatically in the training process. In addition to architecture selection and training duration, different optimization techniques will also be evaluated.

Prior knowledge

This is the targeted input of prior knowledge to optimize feature extraction even with little data. The CNNs are pre-trained on the publicly available ImageNet data set [16]. Three different ways are distinguished in the experiments:

  1. 1.

    Random initialization The CNN is initialized with random weights and parameters.

  2. 2.

    Pre-trained The CNN is pre-trained on the 1.2 million images of the ImageNet data set. The feature extraction is frozen. Only the classifier is trained.

  3. 3.

    Fine-tuning The initialization is done as in the “pre-trained” case. The feature extraction is not frozen but adapted to the instruments with fine-tuning during the training process.

Data augmentation

All training images are modified using image processing techniques. During training, the images are randomly rotated or their color values (e.g., hue, saturation, contrast) are changed by up to 5%.

Experiments

The baseline for all evaluations is defined by DenseNet-201. The training is initially trained over 50 epochs, so the training data is presented to the CNN 50 times. The corresponding accuracies for the test data are entered in Table 1. First, the extent to which the use of prior knowledge provides an advantage in the recognition of surgical instruments is evaluated.

It is obvious that the resulting data set is far too small to train a network from scratch. Using prior knowledge helps a lot, but still gives worse results. Only with the fine-tuning approach, very good results are achieved. In the test data, even a Top 5 accuracy of 100% is achieved. This means that correct item number is within the first five predictions of the CNN. The fine-tuning approach is therefore always used for the following calculations.

To ensure the robustness of the recognition even with increased variance, it is advisable to augment the training data set. In each epoch of the training, a uniform distribution is used to randomly project these augmentations onto each image. Thus, the variance within the training data increases significantly. Table 2 shows that expanding the training data in terms of rotation is effective. In contrast, increasing the variance by slightly manipulating the pixel values does not add any value.

Table 1 Evaluation for the usage of prior knowledge
Table 2 Evaluation of data augmentation techniques

Table 3 shows the accuracies of a DenseNet-201 for different training lengths. It can be seen that the difference between 20 and 50 epochs is hardly relevant. But a training over 500 epochs brings a significant increase in recognition.

Table 3 Evaluation of the duration of the training

Finally, different architectures are evaluated against each other. All methods are trained without data augmentation over 50 epochs. The BiT-M again illustrates the influence of transfer learning. Table 4 shows the increase compared to the DenseNet-201 with 5.7% points in the Top 1 accuracy for the validation data and 4.1% points for the test data. EffiecientNet-B7 achieves the highest values for the Top 1 accuracy of the validation data and the test data in this work. The Top 1 accuracy of the test data with 99.9% means in absolute numbers that there is only one misclassification in the 936 test images.

Table 4 Evaluation of different network architectures

Discussion

It is obviously that the specific integration of prior knowledge is necessary in order to be able to use CNNs in the scenario described. This is mainly due to the size of the networks. To train all the weights with 4496 training images is not possible. Initializing the CNNs with meaningful weights so that only fine-tuning has to be done, is very successful. Another method to address the small amount of training data is to augment the data artificially. For example, the images can be rotated or their pixel values can be changed. Enhancing the training data with rotated images has led to an increase in recognition accuracy. Recognition of instruments is a use case that requires high specialization rather than high generalization of CNNs. The longer an algorithm trains the higher the recognition accuracy. The highest increase is found in the first 10 epochs. Nevertheless, it has been shown that longer training times can still achieve small but measurable increases in recognition accuracy. For future work, it must be noted that the training duration per epoch increases proportionally to the number of training dates.

Conclusion

A total of 156 different surgical instruments are captured and a data set with a total of 6552 images is created. The instruments can be recognized with a maximum Top 1 accuracy of 99.9% and a Top 5 accuracy of 100%. The scientific question of “Scientific question” section can be answered with yes.

This work is the basis for an image-based tracking of surgical instruments in the operating room during an operation. It addresses the recognition of one instrument per image in front of a homogeneous background with controlled lighting conditions. The high variance of possible backgrounds and scenes as well as the very small ratio of instrument size to image size pose the most difficult challenges. Overlapping instruments or covering by operating room personnel are further hurdles to reliable track and trace of instruments in the operating room. It should be noted that the established marker-based technology can only be replaced where recognition at item number level makes sense and where instruments are visible.