Adaptive kernel selection network with attention constraint for surgical instrument classification

Computer vision (CV) technologies are assisting the health care industry in many respects, i.e., disease diagnosis. However, as a pivotal procedure before and after surgery, the inventory work of surgical instruments has not been researched with the CV-powered technologies. To reduce the risk and hazard of surgical tools’ loss, we propose a study of systematic surgical instrument classification and introduce a novel attention-based deep neural network called SKA-ResNet which is mainly composed of: (a) A feature extractor with selective kernel attention module to automatically adjust the receptive fields of neurons and enhance the learnt expression and (b) A multi-scale regularizer with KL-divergence as the constraint to exploit the relationships between feature maps. Our method is easily trained end-to-end in only one stage with few additional calculation burdens. Moreover, to facilitate our study, we create a new surgical instrument dataset called SID19 (with 19 kinds of surgical tools consisting of 3800 images) for the first time. Experimental results show the superiority of SKA-ResNet for the classification of surgical tools on SID19 when compared with state-of-the-art models. The classification accuracy of our method reaches up to 97.703%, which is well supportive for the inventory and recognition study of surgical tools. Also, our method can achieve state-of-the-art performance on four challenging fine-grained visual classification datasets.


Introduction
The health care sector has long been an early adopter and benefited greatly from technological advances. In recent years, artificial intelligence (AI) technologies, i.e., deep neural networks, play a key role in many health-related realms, including disease prediction [17] and diagnosis [19], intelligent robot-assisted surgery [16], health monitoring [46], the development of new medical procedures [1], etc.
Computer vision (CV), as one of the most successful research directions in the field of AI, has achieved remarkable breakthroughs in the health care industry, helping medical professionals in saving their valuable time on basic tasks while also saving patients' life. The focus of CV in health care has been placed on solving various medical tasks by processing different types of medical/pathological images. For example, in the early recurrence prediction of hepatocellular carcinoma [50], radiomics features are extracted from arterial and portal venous-phase CT images for evaluating the preoperative clinical factors. Besides, medical image processing has also been applied to solve different medical tasks, including computer-aided detection (CADe) in radiological diagnoses [34], prostate image analysis based on 3D image segmentation [30], 3D MRI brain coronal slices image registration [3], detection of critical findings in head CT scans [7], etc. These studies or designs in the field of CV have mainly been considered as auxiliary tools for doctors in numerable analysis or operations, thereby providing instructions to obtain a higher precision on diagnosis, prediction, screening, tracking and so on.
Despite the remarkable success of CV on auxiliary medical diagnosis, recent studies have also actively ventured into other emerging application domains in the health care sector, for example, the robot-assisted surgery based on a 3D camera [15], rehabilitation training based on vision reconstruction for people with visual impairment [6], health monitoring on patients for disease prediction and prevention [33], etc. Among these studies, relevant research works on medical instrument images, i.e., the surgical instruments, which are the most important tools in the procedure of surgery, have less been explored. Notably, the inventory work of the surgical instruments before and after surgeries is of great importance for medical safety. At present, the surgical instrument inventory work is mostly carried out by the professional medical staff. Nevertheless, mistakes occur inevitably due to the negligence or fatigue of human beings. In 2020, the Australian Productivity Commission released a number of medical records that indicate the nationwide medical malpractice has taken off, and 430,000 patients have suffered. Medical malpractices related to medical devices are even more noticeable among these patients. As the Daily Mail goes, the Bungling surgeons left medical instruments inside at least 23 patients who were poisoned, infected, or injured in hospital in just a year. Therefore, it is much of significance to ensure the reliability of inventory work of surgical instruments.
In the surgical instrument inventory work, the medical staff is mainly responsible for checking the type and quantity of surgical instruments. And the identification of surgical instruments is one of the focuses of the verification work. Taking this cue, this paper takes a series of research works on surgical instrument inventory work, with the aim to accurately identify surgical instruments before and after surgery. This work not only saves human resources, but also quickly identifies whether surgical instruments are missed, which is beneficial in preventing secondary infections or fatal medical accidents. At present, medical image studies regarding surgical instruments have mainly focused on surgical tool detection, segmentation and tracking during surgery, so that doctor assistants take a more accurate grasp of the operation process [11]. To the best of our knowledge, the surgical instrument recognition work in this paper serves as the first attempt to identify and classify a surgical instrument kit for the inventory work of surgery.
To support the work, a novel surgical instrument dataset is built. As reported in statistics from the University of Rochester Medical Center, the most common surgical operations in the USA mainly include appendectomy, breast biopsy, carotid endarterectomy, cataract surgery, etc. Accordingly, this work considers 19 categories of surgical instruments from surgical kits of appendix resection, cholecystectomy surgical and cesarean section (including Alice forceps, hemostatic forceps of different sizes, oval forceps, suction head, four kinds of hooks, needle holders, cloth forceps, long and short tooth forceps, thread scissors, tissue scissors, intestinal plate, etc.) as the raw materials to create our surgical instrument dataset (labeled as SID19). Notably, in the proposed SID19, there exist certain surgical tools that belong to the fine-grained classes, which possess the very subtle differences that are intractable to distinguish from one another. Among them, surgical forceps and surgical scissors both contain several sub-categories, namely fine-grained classes. In the proposed SID19, categories of surgical forceps include Alice forceps, hemostatic forceps of different sizes, oval forceps, long and short tooth forceps, and cloth forceps. Categories of surgical scissors include thread scissors, tissue scissors, etc. Objects in these sub-categories usually share large intra-class and small inter-class variances, introducing difficulties for identification task. For example, Fig. 1 displays two forceps categories: Alice forceps and appendix forceps with different states, views and angles. Alice forceps and appendix forceps have a tiny difference in their fore-end, where the fore-end of appendix forceps is much rounder than that of Alice forceps. Therefore, different from the common natural image classification, the presented surgical instrument classification task reveals unique characteristics of finegrained visual classification (FGVC), thus bring additional difficulties.
To tackle the FGVC problem, recent works for FGVC mainly focus on weakly supervised learning, which can be roughly grouped into two categories: attention-based methods to strengthen the intermediate feature maps and other methods to exploit the relationships between feature maps. However, the whole framework is fairly complex due to the additional attention network for the first group and the design of regularizer for the other group is complicated too. Additionally, leveraging the strengthened intermediate feature maps extracted from different stages to explore the relationships of feature maps is rarely exploited. In this paper, we propose a novel fine-grained visual classification framework named SKA-ResNet to explore the efficacy of surgical instrument classification. Our method involves two novel components: a feature extractor with stacked standard residual blocks with selective kernel attention (SKA) module to boost the intermediate feature maps and a multi-scale regularizer to explore the relationships of strengthened intermediate feature maps.
In the feature extraction stage, we propose to embed the SKA module into the end of standard residual block forming a new block called SKA block. As the core of our feature extractor, SKA module is capable of generating special attentions with different receptive fields information to strengthen the intermediate feature maps, which can be described into three different stages, including Divide, Fuse and Aggregation. Particularly, an attention factor with special information is generated by leveraging adaptive kernel selection and SGE attention mechanism, so that the informative expression can be strengthened efficiently. Besides, we introduce a multi-scale regularizer to explore the relationships of different scale feature maps boosted by SKA module. The mid-level and high-level feature maps extracted from different stages are strengthened through a CBAM-layer similar to the CBAM module [42]. Then, the enhanced feature maps and the last outputs of the feature extractor are concatenated before classification. Meanwhile, the relationships of the enhanced feature maps are constrained by a regularizer that matches the prediction distribution of the mid-level features to the high-level ones with KL-divergence. The experimental results demonstrate that our method achieves the best performance on SID19 by around 97.703%, which is feasible for assisted decisionmaking in inventory work. Moreover, our method outperforms the state-of-the-art models on four standard benchmark datasets.
Main contributions of the work are summarized as follows: 1. Considering the medical accidents about loss of surgical instruments during surgeries, surgical instrument classification is proposed for the first time to assist medical staff in inventory work for reducing the medical accidents risk. 2. To explore the work of surgical instrument classification, we adopt the surgical kits corresponding to three most common surgeries (appendectomy, cholecystectomy and cesarean section) as origin materials to create a dataset, SID19, wherein 19 kinds of surgical tools consisting of 3800 images are collected.

A novel attention-based model called SKA-ResNet is
proposed to explore the classification work. The network can capture subtle differences among fine-grained classes by embedding the selective kernel attention module into feature extractor. Further, a multi-scale regularizer is proposed to boost the classification. 4. Results show that our method achieves a high accuracy of around 97.703% on SID19, which is superior to existing methods. Also, it achieves superior performance on four challenging fine-grained visual classification datasets when compared to the state-of-thearts.
The rest of this paper is organized as follows. Section 2 introduces background of the work. Section 3 describes the proposed SKA-ResNet in detail. In Sect. 4, we describe the details of our proposed SID19 dataset. Section 5 shows the experimental settings and results on SID19 and other datasets. Section 6 concludes the paper.

CV applications in health care
Computer vision, as one of the most successfully applied technologies in AI, has been introduced into a wide range of fields to solve specific tasks. Nowadays, various CVpowered technologies are assisting the health care industry in all respects. As a result, medical professionals get a better knowledge about diseases so that they make a sound judgment or even save patients' lives. Today's health care industry strongly relies on precise diagnostics provided by medical imaging, which works with data obtained by different diagnostic technologies including X-ray, computed tomography (CT), magnetic resonance imaging (MRI), etc. Based on heterogeneous pathologic images, medical image analysis has focused on disease prevention, prediction, detection, diagnosis, screening and so on. For example, in the early diagnose of chronic obstructive pulmonary disease, Filho et al. [8] propose to utilize information from lung CT images to identify and classify lung diseases with the automatic feature extractor. Moreover, to detect the stages of cancer if affected, Sekaran et al. [36] utilize CNN to predict the cancer images of the pancreas, which is embedded with the model of Gaussian mixture model with EM algorithm to identify the essential features from the CT Scan. More recently, under the screening work of the coronavirus disease (COVID-19), Wang et al. [40] extract COVID-19's specific graphical features and provide a clinical diagnosis ahead of the pathogenic test derived from the radiographical changes in CT images, thus saving critical time for disease control. In the detection of COVID-19, Apostolopoulos et al. [2] suggest that the state-of-the-art CNN architectures proposed over the recent years for medical image classification with transfer learning are successful in extracting significant biomarkers related to the COVID-19 disease based on X-ray imaging.
Besides the success of CV in medical pathologic analysis, new applications in the health care sector have also emerged. For example, under health monitoring, Suo et al. [39] build a personalized time fusion framework to predict patients' risk of developing certain diseases by monitoring changes in patient visit time. In computer-assisted surgery, Pakhomov et al. [31] focus on binary instrument segmentation by leveraging deep residual learning and dilated convolutions. Moreover, Zhao et al. [47] propose a visual tracking approach using the CNN with a spatial transformer network and a spatiotemporal context learning algorithm for the process of tool tracking frame by frame, which is devoted to enhancing the context-awareness of surgeons in the operating room. Sanchez-Garcia et al. [35] present a new CNN-based fusion approach to build a schematic representation of indoor environments for simulated phosgene images, which aims to train and partially recover the retinal stimulation of visually impaired people in rehabilitation training.
In this paper, we propose to take a series of study centered around the identification of surgical instruments before and after surgery. The study aims to save human resources and reduce the risk of secondary infections or fatal medical accidents incurred by the loss of surgical instruments. The proposed work is carried out upon a newly designed surgical instrument dataset, which is quite different from those relying on pathological images. Additionally, in comparison with tool detection, segmentation and tracking relying on surgery videos in computerassisted surgery, our study works before and after surgery for the inventory of surgical tools.

Fine-grained visual classification
Research works for FGVC tasks mainly proceed along two dimensions, namely strongly supervised learning and weakly supervised learning. Specifically, strongly supervised learning methods add the object bounding boxes, part annotation information and image level labels to the training network for learning specific discriminative location information of the targets [14,24,45]. Nevertheless, this sort of methods suffer as (a) a huge amount of human resources are demanded to label the original images, and (b) the information marked by humans is not accurate sometimes. On the contrary, weakly supervised learning networks are only given the categories of images for classification.
As the most frequently used method in CV research works, attention mechanisms have been widely employed in various classification, detection and segmentation tasks, especially in weakly supervised FGVC tasks. According to attention mechanisms, the informative features are strengthened and the less useful ones are suppressed, simultaneously. Lots of lightweight attention modules are introduced in recent years. For example, a high-efficiency, lightweight gating mechanism is introduced in SENet [13] to strengthen the intermediate feature maps via channelwise importance. Beyond channel dimension, BAM [32] and CBAM [42] generate attention maps along spatial and channel dimensions for adaptive feature reinforcement. Based on group convolution, SGENet [21] proposes a novel spatial group-wise enhanced attention, which focuses on learning different semantic sub-feature maps of each group, intentionally self-enhancing its spatial distribution. Except for spatial and channel dimensions, SKNet [22] firstly suggests to explicitly exploring the adaptive receptive field (RF) size of neurons by introducing a dynamic kernel selection mechanism which is constructed by multibranch convolutions based on different kernels. All the above attention mechanisms constitute lightweight attention modules, which can be embedded into majority backbone networks, promoting the performance of networks. Based on attention mechanism, some methods [9,38,44] construct the additional attention networks for FGVC problem. Although these methods can obtain excellent performance, the architectures of these methods are complicated due to the additional attention networks when comparing with lightweight attention modules.
On the other hand, there are other weakly supervised models introduced in FGVC for feature relationship learning. Methods based on high-order statistics are proposed in visual classification, especially for solving the FGVC problem. Specifically, bilinear CNN (BCNN) [25] performs element-wise square root normalization followed by ' 2 À normalization for bilinear features, achieving impressive performance. Compact bilinear CNN [10] proposes two compact bilinear representations with the same discriminative power as the full bilinear representations but with only a few dimensions compared with bilinear features. By the same token, the core of iSQRT-COV [20] is a meta-layer with loop-embedded directed graph structure, specifically designed for ensuring both convergence of Newton-Schulz iteration and performance of global covariance pooling networks. Other methods propose to exploit relationships of different scale feature maps. Crossx learning [29] introduces an approach to exploit the relationships between different images and different network layers for robust multi-scale feature learning.
However, our method differs from previous works in two aspects: First, embedding adaptive kernel selection mechanism with SGE attentions, our SKA module can strengthen the expression of discriminative regions automatically which is lightweight and efficient. Second, we utilize a multi-scale regularizer to exploit the relationships between the strengthened feature maps for robust performance. In particular, the two parts are complementary in our approach. On the one hand, the feature extraction network based on the attention mechanism can generate feature maps with rich semantic information for multi-scale learning and fundamentally improve the performance of multi-scale feature learning; on the other hand, multi-scale feature learning uses attention-enhanced feature maps combining with constraint conditions to guide the generation of feature maps in the feature extraction stage.

Method
In this section, the detailed architecture of SKA-ResNet is delineated. As depicted in Fig. 2, the whole network is composed of two main components: (1) A novel feature extractor consisting of stacked standard residual blocks with Selective Kernel Attention (SKA) modules that extracts informative feature maps for discriminative regions without additional attention networks. (2) A multiscale regularizer taking relationships between feature maps and images as a constraint that learns the relationships of different fine-grained categories. Different from existing attention-based methods focusing on additional attention networks, our method embeds lightweight SKA modules into the standard residual blocks in a scattered way. Furthermore, the relationships between different feature maps and images are exploited by a multi-scale regularizer for robust fine-grained feature representation. And the entire network is trained end-to-end simply relying on image level labels.

SKA module
In the stage of feature extraction, we introduce a novel and lightweight SKA module as depicted in Fig. 3. The feature extractor can localize the discriminative regions and strengthen the corresponding feature maps automatically by embedding the SKA modules into stacked standard residual blocks of ResNet. Operations in the proposed SKA module are summarized into Divide, Fuse and Aggregation. As shown in Fig. 3, we take the SKA module with a two-branch case as an example for the detailed illustration.
Divide In the Divide stage, given intermediate feature map X 2 R WÂHÂC , it is first sent into two different convolutional layers to generate two feature maps with different semantic information. The convolution layers are grouped with convolutions, Batch Normalization and ReLU function in sequence. Specifically, the two convolution layers are conducted by a 3 Â 3 kernel size and a 5 Â 5 kernel size, respectively. The two obtained feature maps are expressed as Y 1 2 R WÂHÂC and Y 2 2 R WÂHÂC . Note that, the size of the two obtained feature maps is the same as the original feature map X. The procedure can be summarized as:  Fuse To enable neurons to adaptively adjust their RF sizes according to the stimulus content, an element-wise summation gate is adopted to integrate different information from the two branches. We generate the mixed feature map Y 2 R WÂHÂC by a summation gate. Then, the global information of Y is generated by utilizing global average pooling, which is noted as g 2 R C . To prevent the biased magnitude of coefficients between various samples, we employ a normalization in g over the channel. Further, the obtained global feature vector is sent to a fc layer, which is conducted with convolutional layers with 1 Â 1 kernel size, Batch Normalization and ReLU function, meanwhile reducing the dimension of g for better efficiency. The obtained compact feature vector is expressed as g 1 2 R C 1 , where C 1 is the dimension after reduction in dimensionality by 1 Â 1 convolutions. The relationship between C and C 1 is controlled by a parameter defined as r, where C 1 ¼ maxðC=rÞ and the minimum value of C 1 is not less than 32. Specially, we embed SGE module into the different branches to generate Y 0 1 and Y 0 2 with spatial group-wise enhanced attentions, respectively. The procedure can be summarized as: where fc, N and F gp Á ð Þ refer to the above fc layer, Normalization and Global Average Pooling. And SGE refers to the operations of SGE module.
Aggregation A softmax operator is applied to the global feature vector g 1 to select different RFs of information, which can be regarded as a soft attention mechanism. Through the operation, we obtain two different informative feature vector v 1 and v 2 corresponding to Y 1 and Y 2 , respectively. For the two generated weights vector v 1 and v 2 , we generate the strengthened feature maps A 2 R WÂHÂC and B 2 R WÂHÂC by employing v 1 to scale Y where softmax refers to the softmax function. As shown in Fig. 4, the standard residual block with our proposed SKA module is exhibited. The proposed SKA module is lightweight without introducing too many volumes of calculations and parameters, so that it can be easily embedded into any mainstream backbone network. Further, it is of high-efficiency in learning informative feature maps for fine-grained visual classification tasks. We employ SKA module to the standard residual block of ResNet for structuring a novel and efficient block that constitutes the core of our feature extractor. By the above feature extractor, we can extract informative feature maps.

Multi-scale regularizer
Multi-scale learning (MSL) has been shown to be useful for numerable visual tasks [5,18,27,28] . Mid-level feature maps usually bear more precise location information, while the high-level ones take more discriminative semantic information. Thus, we apply the simple idea that multi-scale feature maps extracted from different layers are combined to form a pyramid structure for prediction like FPN [26] does. However, the relationships between Under the above feature extractor, we first extract feature maps from mid-level layers and high-level layers. As shown in Fig. 2, let F 3 and F 4 be the feature maps of different layers (3 and 4 refer to stage3 and stage4 of ResNet depicted in Fig. 2), which can be defined as: where C is the number of feature channels and H Â W is the spatial size of the feature map. Then, F 3 and F 4 are fed to a CBAM layer [42] to strengthen the semantic information. The procedure can be summarized as: where r, AvgPool, MaxPool and MLP refer to Sigmoid, Global Average Pooling, Global Max Pooling and Multilayer Perceptron. And f 7Â7 refers to a convolutional layer with a kernel size 7 Â 7.F n is the output of CBAM layer.
Especially,F 3 andF 4 are the corresponding outputs of F 3 and F 4 . Afterward, three prediction distributions, P 3 , P 4 and P, are obtained from the last full connection layer and a SoftMax function. Note that, P is corresponding to F 4 , which is the output feature maps of the original feature extractor. To explore the relationships between feature maps extracted from stage3 and stage4, we propose a regularizer to match different prediction distributions. In terms of implementation, KL-divergence is applied in this paper as a constraint, which can be expressed as: where C refers to the class number, and N donates the number of a mini-batch. p ij refers to the probability value of the i-th sample belonging to the j-th category. KL-divergence suggests P 3 to match with P 4 by minimizing the loss function L msl . A similar regularizer can be added to constrain P 3 , P and P, P 4 as well.

Optimization
Given prediction distributions P 3 , P 4 and P, the loss function for classification can be expressed as: where L c donates the cross-entropy loss and P Ã is the ground-truth label vector. C refers to the class number, and N donates the number of a mini-batch. Finally, the whole model is optimized by the loss function defined as: where a is a hyper-parameter to balance the contribution to different parts. In our settings, a ¼ 1.

Network architecture
Using standard residual blocks with SKA modules, the overall feature extractor architecture of SKA-ResNet50 is listed in Table 1. Besides, as the backbone network of the proposed method and two existing excellent methods based on lightweight attention modules, architectures of the other three models, ResNet50, SGE-ResNet50 and SK-ResNet50, are displayed in Table 1, as well. Similar with ResNet, the proposed SKA-ResNet mainly consists of a stack of repeated residual blocks termed as ''SKA blocks.'' Each SKA block is composed of a sequence of convolution layers and a lightweight SKA module. Generally, the proposed SKA module can be regarded as an independent unit. We obtain the SKA block by adding the unit to the end of sequence operations within the standard residual block. Due to the high-efficiency design of SKA module, SKA-ResNet50 only leads to 2% increase in the number of parameters and 1.8% increase in computational cost, compared with ResNet50. Further, combining adaptive RFs with SGE attention mechanism, SKA-ResNet50 yet introduces the increase in parameters compared with SK-ResNet50, because there are no convolution layers in the SGE attention mechanism. Meanwhile, it brings only a little bit of an increase in computational cost. In SKA block, there is an important hyper-parameter called cardinality which dominates the number of group convolutions in the SGE attention mechanism and a reduction ratio r that controls the number of fc layer parameters in the Fuse stage. In the integral structure of the network, we adopt a similar topological architecture with ResNet. Especially, Table 1 shows the structure of a 50layer SKA-ResNet which has four stages with 3; 4; 6; 3 f g SKA blocks, respectively. By varying the number of SKA blocks in each stage, one can obtain different architectures. In the study, we adopt SKA-ResNet50 as the primary architecture by default.

Dataset
In this paper, we propose a new dataset called surgical instrument dataset (SID19) about the surgical instruments for the inventory work. To our knowledge, it is the first dataset to support the research of surgical instrument classification and recognition that collects surgical tools from particular surgical instrument kits of the most common surgeries. Existing works revolving around surgical instruments, i.e., the dataset of NeuroSurgicalTools [4], mainly focus on real-time tool detection, segmentation and tracking. All of these relevant datasets are proposed to provide a more precise operation understanding for doctors. Different from previous works, SID19 is introduced for the classification of inventory work and has two unique attributes. The one is various surgical instruments are collected from several certain surgical tool kits for the corresponding common surgeries. And the other is that the dataset is collected in an individual view for every tool to ensure accurate classification results under different views. There are majority of fine-grained categories in the proposed dataset which are difficult to distinguish, and the task is regarded as a medical task where high precision should be guaranteed. Hence, each image in this dataset only contains one surgical instrument object. In order to support the research of surgical instrument and present our sample library more clearly, we will publish SID19 on GitHub platform soon.
Generally speaking, one operation corresponds to one surgical instrument kit. And in fact, some surgical instrument kits contain a lot of the same surgical instruments, such as appendectomy kit and cholecystectomy kit. Therefore, based on the three most common operations including appendectomy, cholecystectomy and cesarean section, we introduce a new dataset, SID19, to collect the images of surgical tools in the three corresponding kits. SID19 consists of 19 classes with 3,800 images. And each class in this dataset contains 200 images. Note that, the dataset not only contains coarse-grained classes easy to identify, but also contains fine-grained classes that are difficult to differentiate. For example, Alice forceps and Tissue tweezers belong to coarse-grained classes, but Alice forceps and Appendix forceps belong to fine-grained ones as shown in Fig. 5. Note that, we generate the dataset in the Inside the brackets is the general shape of a residual block, including filter sizes and feature dimensionalities. The number of stacked blocks on each stage is presented outside the brackets. All modules are embedded into the end of the standard residual block. #Params. denotes the number of parameters and GFLOPs represents the number of multiply-adds daytime and night with powerful lights to simulate the circumstance of an operating room. When collecting images, each surgical instrument is placed on a black lightabsorbing cloth with various postures. Especially, as to forceps and scissors classes, we adopt two strategies including open state and closed state to collect the images. Furthermore, the collecting strategy about different angles is adopted when capturing images, which is of great significance because some surgical instruments are identical in the main view but belong to different fine-grained classes, such as Straight hemostatic forceps and Elbow hemostatic forceps. Thus, it is essential to obtain their images in a side view with a specific angle. In the procedure of collecting data, we take an angle between 30 and 60 to collect images in a side view. Above all, all images in SID19 are collected during different time slots, exhibition states, postures and views. The collection environment is a unified workbench with a black light-absorbing cloth. Besides, all images are obtained with a camera in the same resolution of 3456 Â 3456. In a word, the same shooting environment and different shooting requirements are implemented to guarantee the unity and variety of the dataset.

Implementation details
We conduct the experiments on the new proposed surgical instrument dataset, SID19, in which there are 19 classes and 200 images for each class. The ratio of training set to testing set is three to two. We use data sharding for distributed training on SID19, evenly partitioning the data across GPUs. In the data processing stage, the images are RGB-normalized via mean/standard-deviation rescaling. The size of input images is resized to 256 Â 256 for both training and testing. And then, a random resized crop is conducted for each image to get a 224 Â 224 size.
Furthermore, random horizontal flip and vertical flip are employed in the training and testing stage. Besides, we train on SID19 for 50 epochs, and the default batch size is set to 64. The base learning rate is set to 0.01 (0.1 for VGGNet [37]), which decays by 10 in half and threequarters of 50 epochs. The parameter cardinality is set to 32 for generating 32 group-wise enhanced attention maps because of the fixed optimal structure of ResNeXt50 [43]. And the reduction ratio r is set to 16. Specially, we employ the CBAM layer after conv4 6 and conv5 3 to generate the enhanced feature mapsF 3 andF 4 for multi-scale learning in ResNet50. The sizes of two enhanced feature maps are 28 Â 28 Â 1024 and 14 Â 14 Â 2028. Then, there are two different fc layers to generate the corresponding prediction distributions. We adopt top-1 accuracy as the evaluation criterion and the loss is measured by using the cross-entropy function. All experiments are implemented based on Python 3.6 and PyTorch framework.

Ablation study
Batch Size The number of batch size controls the number of mini-batch in a training and testing iteration. As batch size is one of the most vital factors which has a great influence on the weights update and generalization performance of models, an appropriate batch size is essential.
In the experiments, we adopt 16, 32, 64 and 128 as the size of a mini-batch, respectively. From Table 2, it is concluded that with the increase in batch size, the performance of SKA-ResNet and ResNet shows a trend of increasing first and then tending to be stable. Through the experimental results, we recommend the batch size to be 32 or 64 so that there will be an accurate result without occupying too much memory space. In subsequent experiments, we use 32 as the batch size by default. Scale The scale of input data is resized generally before being sent to the network and has a direct impact on the classification results. If the scale is too small, there will be serious information loss. On the contrary, the abstract level of information is not high enough and large calculations are brought. Generally, input data are resized to a scale of 224 Â 224. In the experiments, four different scales are investigated as shown in Table 3. It is concluded that the performance tends to increase gradually with the scale increasing. However, the grown of tendency is inconspicuous from the scale of 224 to 448. Therefore, the scale of input data is resized to 224 Â 224 in the conditions without special instructions. Table 4 reveals the effectiveness of our multi-scale learning along with the proposed regularizer utilizing different constraint strategies. Specially, we express the regularizer by '?,' indicating which two feature maps have such constraint. There are three relationships among P 3 , P 4 and P. The strategy of P 3 þ P 4 means that we encourage the prediction distribution P 3 to match with P 4 . The intention of P þ P 4 and P 3 þ P is similar with P 3 þ P 4 . In the experiments, we also combine the three strategies forming the other three strategies. As shown in Table 4, the strategy combining all the individual strategy of P 3 þ P 4 ; P þ P 4 ; P 3 þ P can achieve the best performance. We also obtain that the effectiveness of individual strategy P 3 þ P 4 can also achieve a better performance comparing with P þ P 4 and P 3 þ P. P 3 corresponds to stage3 which contains more precise location information and P 4 is the output of stage4 which bears more discriminative semantic information. Exploring the relationship between two parts is more effective than the other two strategies. Besides, the results of methods with CBAM layer outperform that without CBAM layer.

Effectiveness of MSL
Effectiveness of SKA module The effectiveness of our SKA module is studied in Fig. 6 and Table 5. For a fair comparison, all the methods do not use the strategy of multi-scale learning. As the core of SKA-ResNet, SKA block adds a novel selective kernel mechanism with attention to the end of the standard residual block, which further improves the accuracy of ResNet50 from 95.014% to 97.042%. Comparing with SGE module and SK module, SKA module is equivalent to equipping multi-branch adaptive kernel selection feature maps with spatial groupwise enhanced attentions, and accuracy is further boosted from 96.203% and 96.197% to 97.042%. Meanwhile, the rate of convergence of SK-ResNet50, SGE-ResNet50 and our model are obviously faster than ResNet50. Apparently, the convergence rate of the proposed model lies at a near level with the highest performance as shown in Fig. 6. Furthermore, the number of parameters and calculations about the four models are displayed in Table 5. It is concluded that our model can obtain the best performance without introducing too many parameters and calculations at almost the same time.

Comparison with state-of-the-Art on SID19
Comparison with Lightweight Attention modules We compare our SKA-ResNet with several prevailing methods that are embedded into lightweight attention modules. In Table 6, quantitative experimental results on SID19 are exhibited. For a fair comparison, all displayed methods are implemented with the proposed multi-scale regularizer based on a unified ResNet50 backbone. And all of the attention modules are employed after the last BatchNorm layer within every bottleneck in ResNet50. As shown in Table 6, it is observed that the proposed SKA-ResNet with SKA module for generating attention maps with adaptive kernel selection mechanism achieves the best overall performance against the prevalent attention modules. SGE-ResNet and SK-ResNet achieve a close accuracy of around 96.9% by leveraging spatial group-wise enhanced attention   Table 7, we compare our SKA-ResNet to attention-based methods for FGVC on SID19. All the displayed models are weakly supervised and introduce additional attention networks to learn the representation of discriminative regions. The attribute column ''1-Stage'' in table indicates that these methods can be trained and tested end-to-end in only one stage. From the presented statistics, our method achieves the state-of-the-art performance on SID19, even though RA-CNN and NTS-Net employ recurrent crops and multi-scale crops, respectively. MA-CNN and MAMC attain a similar result around 96.8% due to the introduction of multiple feature maps. However, compared to our method, the classification performance is reduced by around 0.9%. Focusing on one discriminative region, RA-CNN obtains a better result than MA-CNN and MAMC, but our SKA-ResNet can further obtain 0.8% relative improvement. Furthermore, we can also observe that our method achieves the best performance comparing with other one-stage methods. More than anything, compared with these attention-based methods introducing additional attention networks, our method only introduces few computational burdens and the number of parameters with the best performance due to the lightweight attention module and multi-scale regularizer.
Comparison with Other Methods for FGVC There are other methods introduced for solving FGVC, such as high-order statistics learning and multi-scale feature relationship learning. As is shown in Table 8, non-attentionbased methods for FGVC are implemented on SID19. For a   That all these methods are implemented with the proposed multi-scale regularizer The third column indicates whether the method is trained and tested in one stage or not fair comparison, all of the displayed models possess the same unified ResNet50 as the backbone. Especially, we also implement the bilinear pooling and cross-X learning on VGG-D and SENet, respectively. From Table 8, we observe that iSQRT-COV, a high order statistic method based on bilinear pooling and compact bilinear pooling, achieves an accuracy of 97.185%, which outperforms the two methods by about 0.5%. As to multi-scale feature relationship learning, two models based on cross-X learning achieve a close performance around 97.2%, decreasing by 0.5% of our method. It is concluded that all of the methods in Table 8 hold lower accuracy than the proposed method. The improvement indicates the effectiveness of the two main components of SKA-ResNet.

Comparison with state-of-the-art on FGVC datasets
The comparison results on four challenging FGVC datasets including CUB-200-2011 (Birds), Stanford Cars (Cars), Stanford Dogs (Dogs) and FGVC Aircraft (Aircraft) are reported in Table 9. Considering that we do not use any bounding box/part annotations in all our experiments, some of the compared approaches depending on bounding box/part annotations are not presented in parentheses for direct comparisons. From Table 9, we can see that our approach achieves state-of-the-art or comparable results on four datasets. In particular, we obtain the best performance in terms of accuracy (as highlighed by the bold values in Table 9 on CUB-200-2011, Stanford Cars, Stanford Dogs. Meanwhile, the result of our method on Aircraft is comparable with the state-of-the-art methods. Experimental results are grouped into three parts in Table 9. As a strong baseline, the results of ResNet-50 by itself are shown in the first part, while our SKA-ResNet outperforms it on all datasets. The results of a certain number of attention-based methods are presented in the second part. Compared with these approaches focusing on constructing complex attention networks for discriminative regions, our approach embeds a lightweight adaptive kernel selection module with SGE attentions into the residual blocks to strengthen the intermediate feature maps. It is clearly summarized that we achieve state-of-the-art performance on four datasets, and it is worth noting that our approach does not introduce too many parameters and calculations. Furthermore, we report the results of other FGVC methods in the third part. We find that the performance of our approach outperforms that of iSQRT-COV and cross-X learning, which are state-of-the-art feature relationship learning methods. However, the optimization of our method is much easier due to the embedding SKA module in the feature extractor. Fig. 7 depicts the resized activation maps of six images from SID19 (including Elbow hemostatic forceps1, Elbow hemostatic forceps2, Needle holder, Integrated tissue scissors, Straight surgical scissors1 and Straight surgical scissors2) based on four different models including ResNet50, SGE-ResNet50, SK-ResNet50 and SKA-ResNet50. The activation map of a certain layer usually strongly emphasizes discriminative regions of the input image. It is intuitively understood that how the network works in a certain layer by observing the highlighted regions in the activation map. We obtain activation maps by using the same method [49]. As the probability for target class, the softmax scores are displayed below the corresponding activation maps for the qualitative analysis.

Network visualization
From Fig. 7, we can obviously observe that SKA-ResNet50 covers more complete object regions than the The displayed methods include high-order statistics learning and multi-scale feature relationship learning other three models. Meanwhile, the discriminative regions in activation maps based on SKA-ResNet50 are even brighter. In the aspect of softmax scores, SKA-ResNet50 takes a more precise probability than other networks, which is harmonious with the activation maps.

Conclusion
In this paper, we take a series of research works on surgical instrument image classification. Firstly, the work for surgical instrument image classification is proposed to assist medical staff with the inventory work of medical instrument under the background of reducing the risk of surgical instrument loss after surgery. Secondly, we collect the first surgical instrument dataset called SID19 based on the three most common surgeries to support the research. More importantly, we propose a novel attention-based model called SKA-ResNet with lightweight SKA modules to strengthen informative feature maps of discriminative regions and a multi-scale regularizer to exploit the relationships between different feature maps. And a myriad of state-of-the-art classification models are implemented on the proposed dataset and four challenging FGVC datasets. Experimental results show that our approach achieves state-of-the-art performance, which is enough to be the  Fig. 7 Visualization results of ResNet50, SK-ResNet50, SGE-ResNet50 and SKA-ResNet50. The activation map is calculated for the last convolutional outputs. The ground-truth label is shown on the top of each input image and P denotes the softmax score of each network for the ground-truth class theoretical basis for inventory work. Ablation studies further prove the effectiveness of the components in SKA-ResNet.
In the future work, it is planned to be processed into the object detection task of surgical instruments when solving the inventory work of surgical instruments. Specifically, firstly, object annotation is performed on the existing surgical instrument dataset and combined with the finegrained image classification algorithm proposed in this paper and object detection algorithm, the object detection of single object image is initially realized. Then, the object detection task of multiple similar surgical instruments can be realized step by step, which further serves the inventory work of surgical instruments.