Introduction

Knowledge distillation has shown promising results for classifying skin lesions in dermatology images [1,2,3,4,5]. Knowledge distillation encompasses three types of knowledge: logits-based knowledge, intermediate feature-based knowledge, and relationship-based knowledge [6]. Logits-based knowledge distillation [7, 8] enables the student to mimic the teacher's soft output by adjusting the temperature parameter. Knowledge transfer through logits is accomplished by minimizing the Kullback–Leibler (KL) divergence between the student’s and teacher's final outputs. Intermediate feature-based knowledge distillation [9,10,11] is utilized to transfer the intermediate features of the teacher to the student as knowledge. While the approaches above focus on transferring knowledge between individual instances, they overlook the valuable knowledge embedded in the relationships between multiple instances. Consequently, relationship-based knowledge distillation [14, 21] can yield superior performance by transferring the relationships between multiple instances to the student as a form of knowledge with more substantial representational power. Motivations: However, in traditional knowledge distillation, the student typically passively mines information from the teacher, which can restrict the students' learning potential. This limitation becomes more pronounced when there is a knowledge gap between student and teacher models in skin lesion classification. The passive learning paradigm impedes the student's ability to mine the teacher's knowledge fully. Furthermore, the teacher model might contain unnecessary information when classifying dermatology images or retain misleading pseudo-label interference. Indiscriminately mimicking the teacher's knowledge can hinder students' potential for classifying dermatology images. A more effective approach would be for the student to actively mine knowledge from the teacher based on their specific challenges in classifying dermatology images.

Boosting algorithms [22, 23] sequentially learn multiple weak classifiers (i.e., base classifiers) and then combine them with weights to obtain a robust classifier. AdaBoost algorithm [24, 30] is the most typical Boosting algorithm, where sample weights are readjusted based on the error of the base classifier. Misclassified samples are given higher weights and emphasized in the following base classifier, following this iterative process. The weight of each base classifier is estimated based on its performance. These weighted base classifiers are then integrated to generate a robust classifier. Utilizing AdaBoost regulations during deep neural network training has been shown to improve the representation power of network models [25,26,27,28,29,30,31,32,33]. For instance, Taherkhani et al. [27] proposed AdaBoost-CNN by combining AdaBoost with a convolutional neural network (CNN), successfully addressing the multi-class imbalanced sample classification issue. Shakeel et al. [29] utilized the AdaBoost concept in ensemble neural networks and analyzed lung features to classify normal and abnormal lung categories. Huang et al. [34] developed a BoostResNet network using the Boosting principle to train ResNet via a step-by-step method and documented its efficacy in enhancing model training through theoretical validation. In addition, Sun et al. [30] introduced the AdaBoost algorithm to graph convolutional neural networks (GCNs) [30, 31] to integrate the information of neighboring nodes of different orders in an AdaBoost manner. This approach effectively overcomes the over-smoothness problem provoked by aggregating the graph convolutional layers multiple times. Motivations: The student often encounters diverse challenges in knowledge distillation. It is a better paradigm for the student to identify their learning difficulties first and then actively mine relevant knowledge from the teacher to facilitate the distillation process. Therefore, exploring the potential of AdaBoost in helping the student sense the complexity of classifying dermatology images is valuable. Being able to sense difficulty levels could assist the student in deciding the appropriate level of granularity needed when formulating teacher knowledge based on their current comprehension.

Zhou et al. [32] proposed a weighted soft-label distillation framework, WSLD. This framework assigns a dynamic weight to the distillation loss to determine the extent to which the teacher's soft-label information can be utilized based on the cross-entropy loss of both the student and the teacher. However, a student's learning is a sequential process and will encounter varying difficulties at different stages. The WSLD framework only mines logits-based knowledge, neglecting the representational knowledge inherent in each stage and failing to utilize intermediate feature-based and relationship-based knowledge. Meanwhile, this method cannot avoid noise interference from the teacher. Innovations: This research introduces AdaBoost and a variational difficulty mining strategy (VDMS) to knowledge distillation and proposes a distillation framework called Variational AdaBoost Knowledge Distillation (VAdaKD). The framework aims to help the student determine the “granularity” of mining the teacher's knowledge by considering the learning difficulties of skin lesion classification in dermatology images. Specifically, we apply AdaBoost to treat all stages of the student model as a sequential learning process. We also introduce an intermediate auxiliary classifier for each stage, where the weights of input samples from each stage correspond to the degree of learning difficulty. The paper proposes actively leveraging the teacher's knowledge based on the student's learning difficulty at each stage, facilitating targeted knowledge transfer. Finally, the paper performs linear weighting based on the weights of the base classifiers to obtain the final distillation loss. In this research, we incorporate three forms of knowledge into the distillation loss: logits-based, intermediate feature-based, and relationship-based, respectively. However, as the student's learning process progresses, it becomes increasingly challenging to identify the degree of learning difficulties in subsequent learning stages due to noise interference from the teacher. Therefore, we first adopt the idea of GCN to construct the nearest-neighbor relationship matrix \(A\). This matrix helps us calculate the information of the current node's \(l\) th hop and then recount it to the student. The goal is to enable the student to perceive the nuanced classification difficulties by leveraging the multi-hop information among dermatology samples while maintaining the same nearest-neighbor relationship with the teacher. Next, we eliminate noise interference from these nuanced difficulties by maximizing the mutual information between the teacher and student. Contributions: The main contributions of this paper can be summarized as follows:

  1. (1)

    This paper proposes a Variational AdaBoost Knowledge Distillation framework, VAdaKD, to address the limitations of conventional knowledge distillation methods, where the student passively learns the teacher's knowledge. VAdaKD offers a more active paradigm for knowledge distillation, allowing the student to determine the “granularity” in mining the teacher's knowledge within this framework.

  2. (2)

    VAdaKD employs a two-step strategy to improve the efficiency of AdaBoost-based knowledge distillation for categorizing dermatology images. Initially, the student is empowered to mine the teacher's learning representation through AdaBoost actively. Subsequently, a variational difficulty mining strategy (VDMS) is introduced to reduce the influence of noise from the teacher by maximizing the mutual information shared between the teacher and student.

  3. (3)

    Finally, we formulate the weighted distillation loss with sample-level weights to effectively incorporate three types of knowledge. Our research involves extensive experiments on three well-known dermatological datasets, namely the Dermnet ISIC2019 and HAM10000, respectively. The results of our experiments clearly show the efficacy of our proposed VAdaKD method. Additionally, the visualization results confirm that VAdaKD excels in identifying learning challenges and reducing interference from teacher noise at various stages, consequently enhancing the classification accuracy of dermatology images.

In contrast to previous work on knowledge distillation, our proposed VAdaKD introduces a novel paradigm for actively mining the teacher's knowledge, leading to a more comprehensive perception of the knowledge in the teacher while minimizing the presence of unnecessary information. The highlights of this paper are outlined below:

  1. (1)

    Propose a distillation framework, VAdaKD, for actively mining the teacher's learning representation for skin lesion classification.

  2. (2)

    Design a GCN-based difficulty mining to perceive the more nuanced classification difficulties.

  3. (3)

    Construct the weighted distillation loss with sample-level weights to effectively engage three forms of knowledge.

Section "Introduction" of this paper outlines the limitations of traditional knowledge distillation methods and proposes potential solutions. Section "Related work" reviews relevant literature, while Sect. "Methodology" introduces our proposed method. Experimental results and visualizations on three datasets are presented in Sect. "Experiments" to demonstrate the efficacy of our method. The results of the research are discussed in Sect. "Conclusion".

Related work

Knowledge distillation

Knowledge distillation, renowned for its ability to condense models, enables knowledge transfer from a cumbersome teacher model to a compact student model. This allows for efficiently deploying the developed skin lesion classification and diagnostic model on lightweight mobile devices. Hinton et al. [7] proposed to enable a student to learn the hidden knowledge of the teacher by reducing the KL divergence of the last layer's soft output with temperature. Zhao et al. [8] developed a framework called Decoupled Knowledge Distillation (DKD) that separates the information of target and non-target classes present in logits. They concluded that the effectiveness of logits-based knowledge distillation is due to the information provided by the non-target classes. Hossain et al. [40] proposed LumiNet to reconstruct finer inter-class relationships, enabling the student model to learn richer knowledge. RCO [13] introduced a route-constrained optimization strategy to narrow the knowledge gap between the teacher and the student through hint learning with route constraints. In addition to extracting logits-based information from the hierarchical backbone of the teacher, there is also the possibility of mining the intermediate feature-based representations. FitNets [9] achieved this by minimizing the \(L_{2}\) loss of the intermediate features between the student and teacher. AT [10] focused the attention on the intermediate features, thereby encouraging the student to learn more discriminative feature representations. FT [11] utilized a paraphraser and translator to extract factors from the teacher and student, respectively, which are then operated as distillable intermediate-layer representations. Yang et al. [12] incorporated auxiliary classifiers from all stages of the backbone to transfer intermediate-layer knowledge. Shu et al. [42] enhanced the student network's attention towards the most salient regions in each channel by computing KL divergence between the student's and teacher's channel probability maps. However, the aforementioned knowledge distillation methods only focus on individual-instance knowledge distillation. PKT [14] matched the probability distributions of the feature space between the student's and teacher's data samples for knowledge transfer, and it effectively explored the relationships between different individuals. RKD [15], SP [16], FSP [17], and IRG [19] engaged in various relationship matrices to facilitate knowledge transfer. Additionally, CCKD [21] leveraged both individual and inter-individual relationships to enrich knowledge distillation. In addition, the relationships between different layers in the distillation framework also provide valuable representational information. Lee et al. [18] used a correlation matrix as a feature map and extracted vital information from the feature map using singular value decomposition. Passalis et al. [20] suggested that the student can simulate the information flow in the teacher to facilitate knowledge transfer. Huang et al. [41] found that predicting by matching KL divergence may underperform when the teacher and student have a considerable gap. They proposed a correlation-based DITS loss to better capture the teacher's intrinsic inter-class relationships. However, the current distillation framework primarily focuses on the passive mimicry of feature representations from the teacher without considering how the student can actively extract beneficial knowledge in skin lesion classification.

AdaBoost in neural networks

The sequence of stages in the backbone of a student model can be considered as an ordered learning setting. AdaBoost is a sequential learning algorithm that trains base classifiers at different stages. It assigns higher weights to misclassified samples to prioritize them in the subsequent stages. Eventually, all the base classifiers are combined with respective base classifier weights to form a robust classifier. Researchers have successfully applied the AdaBoost algorithm to train models like convolutional neural networks, ensemble learning, and graph convolutional networks, and these applications have suggested satisfactory results [22,23,24,25,26,27,28,29,30]. CNN's performance is hindered when multi-class unbalanced samples are encountered. To address this, Taherkhani et al. [27] introduced AdaBoost-CNN, a method that effectively tackles the classification of multi-class unbalanced samples by combining AdaBoost with CNN.

Furthermore, to improve the joint optimization of ensemble neural networks, Shakeel et al. [29] utilized AdaBoost to ensemble neural networks and analyze the features of medical image data more effectively. Huang et al. [34] employed AdaBoost to train ResNet in a progressive training manner and showed the potential of boosting theory in ResNet under weak learning conditions. Besides, to address the issue of over-smoothness resulting from multiple aggregations of graph convolutional layers in the GCN, Sun et al. [30] proposed AdaGCN, which utilizes AdaBoost to integrate information from neighboring nodes of high orders. However, there is an urgent need to investigate how AdaBoost can actively enhance the students' ability to mine the teacher's learning representation in knowledge distillation. This research seeks to integrate AdaBoost into a knowledge distillation framework and develop a new paradigm for skin lesion classification in dermatology images. The goal is to enable the student model to determine the "granularity" of mining teacher's knowledge.

Active knowledge distillation

Traditional knowledge distillation methods involve the student model passively mimicking the teacher. WSLD [32] introduced a weighted soft-label approach and assigned a dynamic weight to the distillation loss based on the student's and teacher's learning on the supervised task. CTKD [33] proposed a dynamic temperature hyperparameter distillation framework. This framework increases distillation loss by adjusting the temperature adversarially, allowing the student to conduct knowledge transfer from easy to complex. While WSLD and CTKD provide a form of active knowledge distillation, they cannot perceive the nuanced difficulty inherent in the learning process, and it is difficult to eliminate noise interference in the teacher.

Additionally, WSLD and CTKD only distill at the backend and do not utilize intermediate features or relationship knowledge. While AdaBoost-CNN [27], BoostResNet [34], and AdaGCN [30] have involved AdaBoost in deep neural networks, its application in knowledge distillation remains to be further explored. Additionally, as the learning process progresses, the learning difficulties become increasingly difficult to recognize due to noise interference from the teacher, and exploring how to discover these difficulties at the current learning stage effectively is a research topic that has yet to be investigated. This research engages AdaBoost with the sequence of the stages in the backbone of a student model as a progressive learning process. The weights of the input samples in each stage represent the degree of learning difficulty. The classifying knowledge in the teacher is actively mined according to the student's learning difficulty to carry out a targeted knowledge transfer. To better mine the nuanced difficulties in the subsequent stages of skin lesion classification, we first incorporate the GCN [30, 31] to capture nuanced higher-order information from the nearest-neighboring nodes. GCN involves extracting and recounting the information from the current node's \(l\) th hop to the student during the learning process. We also ensure that the nearest-neighbor relationship between the student and teacher remains consistent. Next, we aim to eliminate noise interference from these nuanced difficulties by maximizing the mutual information between the teacher and student.

Dermatologic diagnostic systems in clinical applications

To further advance the use of dermatologic diagnostic systems in the clinic, Haberman et al. [43] propose to modify the dermatologic diagnostic system (DIAG). They strive to provide clinical acceptability and enhance diagnostic potential. Brooks et al. [44] introduce the skin disease diagnostic prompting system, DERMIS. The DERMIS system generates lists of plausible diagnoses based on probabilities computed from Bayes' theorem and is designed to be used exclusively by non-dermatologists (e.g., general practitioners). Liu et al. [45] have developed a deep learning system, DLS, to aid in the differential diagnosis of dermatological diseases. This system offers differential diagnosis for 26 specific dermatological conditions, thereby significantly enhancing the accuracy and effectiveness of dermatological diagnosis and treatment in primary care settings. In order to promote the use of healthcare mobile apps in telemedicine, Hameed et al. [46] proposed the detection of 4 dermatological diseases using mobile apps on the Android platform. This approach eliminates the necessity for patients to visit the clinic physically. A reported 1.9 billion people worldwide are affected by skin diseases, with the shortage of dermatologists leading to many seeking dermatologic care from general practitioners, resulting in lower diagnostic accuracy [45]. Additionally, the utilization of heuristic/evolutionary search optimization AI-based algorithms, such as the Gravitational Search Algorithm [47,48,49] and Inclined Planes System Optimizations [50, 51], to tackle hyperparameter optimization during network training is a promising research area. This paper introduces supervised loss, AdaBoost loss, and distillation loss, suggesting that enhancing the performance of dermatological diagnostic models could be achieved by carefully selecting optimal values for loss term weights and the temperature coefficient. Developing efficient and stable systems [52,53,54] to deploy lightweight dermatologic diagnostic models on mobile devices will aid in early screening and improve rural diagnosis and treatment.

Methodology

Framework design

The framework of the Variational AdaBoost Knowledge Distillation, VAdaKD proposed in this paper is illustrated in Fig. 1. The figure illustrates the evolution of sample weights throughout the training process and demonstrates how the student model leverages these weights to mine and learn knowledge from the teacher model actively. Both the teacher and student models are assumed to have \(L\) stages. We consider the stages in the backbone as an ordered learning process. We incorporate intermediate auxiliary classifiers for each stage, resulting in a total of \(L\) base classifiers. The base classifier for the \(l\) th stage is denoted as \(p_{l} \left( \cdot \right)\). The function \(p_{l} \left( \cdot \right)\) comprises a convolutional layer, a global average pooling layer, and a fully-connected layer. AdaBoost learns base classifiers sequentially, and the weights of the input dermatology samples for the \(l\) th base classifier are denoted as \(\left\{ {w_{l - 1}^{i} } \right\}_{i = 1}^{B}\), where \(B\) represents the number of samples in the input batch. The base classifier weight can be calculated as \(\alpha_{l}\). The weights of the input dermatology samples at each stage measure the learning difficulty in skin lesion classification. These sample-level weights, represented by \(\left\{ {w_{l - 1}^{i} } \right\}_{i = 1}^{B}\), are utilized in the distillation module to identify which dermatology samples the student finds difficult to learn at that stage. The student can actively mine knowledge from the teacher regarding the learning difficulties in skin lesion classification. However, as the student's learning process progresses, it becomes increasingly challenging to identify learning difficulties due to noise interference from the teacher. We introduce a variational difficulty mining strategy (VDMS) that aims to minimize the impact of noise by maximizing the mutual information between the teacher and student. The final distillation loss is then obtained by linearly weighting the individual base classifiers using the base classifier weights \(\alpha_{l}\). The losses in the VAdaKD encompass the cross-entropy loss for the student task, the AdaBoost training loss for the student task, and the Variational AdaBoost-based distillation loss, respectively.

Fig. 1
figure 1

The proposed framework, AdaBoost Knowledge Distillation, VAdaKD in this paper. VAdaKD empowers the student to actively mine the teacher's learning representation in skin lesion classification using AdaBoost. The weights of the dermatology input samples, denoted as \(\left\{ {w_{l}^{i} } \right\}_{i = 1}^{B}\) for each stage, represent the learning difficulty for each sample \(x_{i}\) at the l-th stage. Based on these weights, the three forms of knowledge are selectively transferred to the student. Among them, \(F_{l}^{S}\) and \(F_{l}^{T}\) represent the intermediate features of the student and teacher at the l-th stage. \(f_{i}^{s}\) and \(f_{i}^{t}\) represent the outputs of the \(i\)-\({\text{th}}\) dermatology sample for the student and teacher, respectively. \(C\) represents the correlation between two specific dermatology samples. Finally, the final distillation loss is obtained by linearly weighting each form of knowledge according to the base classifier weights \(\alpha_{l}\). We introduce a VDMS that eliminates noise interference from nuanced difficulties by maximizing the mutual information between the teacher and student

The distillation loss comprises three types of knowledge: logits-based knowledge, intermediate feature-based knowledge, and relationship-based knowledge, respectively. However, it becomes increasingly challenging to identify learning difficulties due to noise interference from the teacher. VDMS first incorporates the GCN to form a nearest-neighbor relationship matrix \(A\). This matrix is then used to calculate the information of the current node's \(l\) th hop and repeat it to the student. Next, VDMS eliminates noise interference from these nuanced difficulties by maximizing the mutual information between the teacher and student. The general framework of AdaKD, as proposed in this paper, is depicted in Fig. 1.

Pre-training Teacher Model

The teacher model for skin lesion classification consists of a backbone network, denoted as \(f^{t} \left( \cdot \right)\), with \(L\) stages. The auxiliary (i.e., base) classifiers, represented by \(\left\{ {p_{l}^{t} \left( \cdot \right)} \right\}_{l = 1}^{L}\), are contained within the backbone. The pre-training process of the teacher model is divided into two phases. In the first phase, the teacher model with \(L\) stages is trained, resulting in the trained backbone \(f^{t} \left( \cdot \right)\). In the second phase, the weights of the backbone \(f^{t} \left( \cdot \right)\) are frozen, and the parameters of the auxiliary classifiers \(\left\{ {p_{l}^{t} \left( \cdot \right)} \right\}_{l = 1}^{L}\) are updated. According to the AdaBoost boosting theory, the weights of the input dermatology samples for the \(l\) th base classifier are represented as \(\left\{ {w_{l - 1}^{i} } \right\}_{i = 1}^{B}\), where B represents the number of samples in the input batch. Then, based on the error rate of each base classifier, we calculate the weight of the base classifier \(\alpha_{l}\). Finally, we obtain the AdaBoost loss by linearly weighting all base classifiers by weight \(\left\{ {\alpha_{l} } \right\}_{l = 1}^{L}\). Note that we use the cross-entropy loss in both training phases to update the parameters for true-label supervision. It is worth noting that, at the \({ }l\) th stage, the error rate of the auxiliary classifier is represented by Eq. (1),

$$ err_{l} = \mathop \sum \limits_{i = 1}^{B} w_{l - 1}^{i} {\mathbb{I}}\left( {y_{i} \ne p_{l}^{t} \left( {x_{i} } \right)} \right)/\mathop \sum \limits_{i = 1}^{B} w_{l - 1}^{i} , $$
(1)

where \(y_{i}\) represents the true label (i.e., Ground Truth) of the dermatology input samples \(x_{i}\), and the batch size is denoted as \(B\). The weight \(\alpha_{l}\) of the auxiliary classifier for the \(l\)-\({\text{th}}\) stage is shown in Eq. (2),

$$ \alpha_{l} = \log \frac{{1 - err_{l} }}{{err_{l} }} + \log \left( {K - 1} \right), $$
(2)

where \(K\) represents the number of categories. In order to ensure that \(\alpha_{l}\) is positive, the condition \(\left( {1 - err_{l} } \right) > 1/K\) needs to be satisfied. We then proceed to update the weights of the input samples \(x_{i}\), giving a higher weight to the misclassified sample. Finally, \(w_{l}^{i}\) represents the weight of the dermatology input sample for the \(l\) + 1th base classifier, as shown in Eq. (3). This process is iterative, and the initial weight of the sample \(x_{i}\) is set to \(w_{0}^{i} = 1/B\).

$$ w_{l}^{i} \leftarrow w_{l - 1}^{i} \cdot \exp \left( {\alpha_{l} \cdot {\mathbb{I}}\left( {y_{i} \ne p_{l}^{t} \left( {x_{i} } \right)} \right)} \right). $$
(3)

As depicted in Fig. 2, the initial training phase involves using the true labels of skin lesion classification to supervise training the teacher model's backbone. In the subsequent training phase, the true labels are still used for supervision, but the weights of the backbone are frozen. AdaBoost is also employed to train the auxiliary classifiers introduced at all stages. Finally, the outputs of each classifier are integrated into a final prediction utilizing the corresponding weights.

Fig. 2
figure 2

Two-phase pre-training of a teacher model with \(L\) stages for skin lesion classification. In the first phase, the backbone \(f^{t} \left( \cdot \right)\) is trained. In the second phase, the weights of the backbone \(f^{t} \left( \cdot \right)\) are frozen, and the parameters of the auxiliary classifiers \(\left\{ {p_{l}^{t} \left( \cdot \right)} \right\}_{l = 1}^{L}\) are updated using the AdaBoost boosting theory, where the input sample weights of the \(l\) th base classifier are denoted as \(\left\{ {w_{l - 1}^{i} } \right\}_{i = 1}^{B}\). Then the weight \(\alpha_{l}\) of the base classifier is calculated based on the error rate. Again, the dermatology input sample weights \(\left\{ {w_{l}^{i} } \right\}_{i = 1}^{B}\) of the \(l\) + 1th auxiliary classifier are updated and adjusted by the \(l\) th auxiliary classifier

Training student model

We refer to the backbone of the student model for skin lesion classification as \(f^{s} \left( \cdot \right)\). Additionally, we have auxiliary classifiers within this backbone, denoted as \(\left\{ {p_{l}^{S} \left( \cdot \right)} \right\}_{l = 1}^{L}\), where \(p_{l}^{s} \left( \cdot \right)\) represents the auxiliary classifier of the \(l\) th stage in the student. The student is trained under the guidance of the pre-trained teacher. The overall loss function includes the task loss of true-label supervision, the AdaBoost loss of true-label supervision, and the weighted distillation loss with sample-level weights between the student and teacher.


Task loss of true-label supervision in skin lesion classification \( {\varvec{L}}_{{{\varvec{ce}}}}\): The task loss in the student is determined by the cross-entropy loss of true-label supervision. The objective is to enable \(f^{s} \left( \cdot \right)\) to accurately classify one-hot labeled data. The task loss \(L_{ce}\) is obtained by computing the cross-entropy of the final outputs against the true labels.


AdaBoost loss of true-label supervision in skin lesion classification \({\varvec{L}}_{{{\varvec{ada}}\_{\varvec{ce}}}}\): AdaBoost employs a sequential training approach and this involves training the student model using \(L\) auxiliary classifiers with prediction functions \(\left\{ {p_{l}^{s} \left( \cdot \right)} \right\}_{l = 1}^{L}\) The input samples' weights of the \(l\) th base classifier are denoted as \(\left\{ {w_{l - 1}^{i} } \right\}_{i = 1}^{B}\). Similarly, the input samples' weights \(\left\{ {w_{l}^{i} } \right\}_{i = 1}^{B}\) of the \(l\) + 1th base classifier are updated and adjusted by the \(l\) th base classifier. The weight of the \(l\) th base classifier \(\alpha_{l}\) is then computed based on its error rate. Finally, the AdaBoost loss is obtained by linearly combining the individual base classifiers by weights \(\left\{ {\alpha_{l} } \right\}_{l = 1}^{L}\). In addition, as the learning process of skin lesion classification progresses, it becomes increasingly challenging for the subsequent learning stages to mine nuanced classifying difficulties. To address the problem, the information from the current node's \(l\) th hop in the teacher is repeated to the student. Specifically, the correlation matrix \(\left\{ {{\text{\rm A}}_{l}^{t} \left( \cdot \right)} \right\}_{l = 1}^{L} \in R^{B \times B}\) of each stage in the teacher model is copied to the student model. Here, \(B\) represents the batch size, and the features of the \(l\) th stage in the student are transformed into \(F_{l}^{S} = {\text{\rm A}}_{l}^{t} F_{l}^{S} \in R^{B \times C \times H \times W}\). The final AdaBoost loss of true-label supervision is shown in Eq. (4),

$$ L_{ada\_ce} = - \mathop \sum \limits_{i = 1}^{B} \mathop \sum \limits_{l = 1}^{L} y_{i} \log \alpha_{l} p_{l}^{s} \left( {x_{i} } \right) , $$
(4)

where \(y_{i}\) represents the true label of the dermatology input sample \(x_{i}\),\(p_{l}^{s} \left( {x_{i} } \right)\) is formulated as \(p_{l}^{s} \left( {x_{i} } \right) = C_{l} \left( {w_{l - 1}^{i} \cdot F_{l}^{S} \left( {x_{i} } \right)} \right)\), the intermediate features of the input sample \(x_{i}\) is extracted by \(F_{l}^{S} \left( {x_{i} } \right)\) and is classified by the \(l\) th auxiliary classifier module \(C_{l} \left( \cdot \right)\) with the weight \(w_{l - 1}^{i}\). The value of \(w_{l - 1}^{i}\) represents the difficulty of classification learning for \(x_{i}\) in the \(l\) th base classifier, with a larger value indicating higher difficulty.


Weighted distillation loss with sample-level weights \(L_{mimic}\): The weights of the input samples obtained by AdaBoost are represented as \(\left\{ {w_{l - 1}^{i} } \right\}_{i = 1}^{B}\). We can effectively engage three forms of knowledge by applying the sample-level weights to the distillation module. The distillation loss encompasses logits-based, intermediate feature-based, and relationship-based distillation loss. The logits-based distillation loss consists of two forms, the first being the \(KL\) loss of the soft outputs with temperature for the last layer of the student and teacher (as shown in Eq. (5)),

$$ L_{kd} = - \tau^{2} \mathop \sum \limits_{i = 1}^{B} p^{t} \left( {x_{i} ;\tau } \right)\log \left( {p^{s} \left( {x_{i} ;\tau } \right)} \right), $$
(5)

where \(\tau\) represents the temperature, and we set it to 3 in this paper, and \(B\) denotes the size of the input batch. The second form of logits-based distillation loss is the distillation loss of the soft outputs with temperature for the last layer of the student–teacher correspondence auxiliary classifiers, as shown in Eq. (6),

$$ L_{ada\_kd} = - \tau^{2} \mathop \sum \limits_{i = 1}^{B} \mathop \sum \limits_{l = 1}^{L} w_{l - 1}^{i} \cdot p_{l}^{t} \left( {x_{i} ;\tau } \right)\log \left( {p_{l}^{s} \left( {x_{i} ;\tau } \right)} \right). $$
(6)

In order to effectively extract the representation information from the intermediate features \(\left\{ {F_{l}^{T} } \right\}_{l = 1}^{L}\), this paper constructs the distillation loss using Attention Transfer [10] as shown in Eq. (7),

$$ L_{att} = \mathop \sum \limits_{i = 1}^{B} \mathop \sum \limits_{l = 1}^{L} w_{l - 1}^{i} \cdot L_{AT} \left( {{\text{F}}_{l}^{T} \left( {x_{i} } \right),{\text{ F}}_{l}^{S} \left( {x_{i} } \right)} \right), $$
(7)

however, it becomes increasingly challenging to identify learning difficulties due to noise interference from the teacher as the student's learning process progresses. Therefore, we introduce the VDMS to minimize the impact of noises, and the AT-based distillation loss with VDMS is illustrated in Eq. (8),

$$L_{att\_vdms} = \mathop \sum \limits_{i = 1}^{B} \mathop \sum \limits_{l = 1}^{L} w_{l - 1}^{i} \cdot \left( {\frac{{\left( {t_{l}^{T} \left( {x_{i} } \right) - \mu_{l}^{S} \left( {x_{i} } \right)} \right)^{2} }}{{2\sigma_{l}^{2} \left( {x_{i} } \right)}} + \log \sigma_{l}^{2} \left( {x_{i} } \right)} \right), $$
(8)

where \(t_{l}^{T} \left( \cdot \right)\) represents the teacher’s attention map after imposing attention mapping to intermediate feature \({\text{F}}_{l}^{T} \left( \cdot \right)\) as implemented in Attention Transfer [10]. \(\mu_{l}^{S} \left( \cdot \right)\) represents the student’s attention map after imposing attention mapping to intermediate feature \({\text{F}}_{l}^{S} \left( \cdot \right)\), and \(\sigma_{l}^{2} \left( \cdot \right)\) represents the variance of the attention map.

The above two kinds of distillation losses focus solely on individual-instance knowledge distillation. However, to explore the relationship between different individuals in classifying dermatology images, we present a relationship-based distillation loss. Firstly, we extract the logits features of \(B\) dermatology samples from each auxiliary classifier of the teacher model and construct the correlation matrix \(\left\{ {R_{l}^{t} \left( \cdot \right)} \right\}_{l = 1}^{L}\). Similarly, we construct the correlation matrix \(\left\{ {R_{l}^{s} \left( \cdot \right)} \right\}_{l = 1}^{L}\) for the logits features of \(B\) dermatology samples from each auxiliary classifier of the student model. Then, we calculate the Mean Squared Error (MSE) loss between the correlation matrices of the student and teacher to obtain the relationship-based distillation loss, as shown in Eq. (9),

$$ L_{rkd} = \mathop \sum \limits_{l = 1}^{L} \alpha_{l} \cdot L_{mse} \left( {R_{l}^{t} ,{ }R_{l}^{s} } \right) = \mathop \sum \limits_{l = 1}^{L} \alpha_{l} \cdot \left( {R_{l}^{t} - R_{l}^{s} } \right)^{2} , $$
(9)

the relationship-based distillation loss with VDMS is expressed in Eq. (10), where \(\sigma_{l}^{2}\) represents the variance of the correlation matrix,

$$ L_{rkd\_vdms} = \mathop \sum \limits_{l = 1}^{L} \alpha_{l} \cdot \left( {\frac{{\left( {R_{l}^{t} - R_{l}^{s} } \right)^{2} }}{{2\sigma_{l}^{2} }} + \log \sigma_{l}^{2} } \right). $$
(10)

Total Loss: The total loss of the student model includes the task loss for true-label supervision, the AdaBoost loss for true-label supervision, and the weighted distillation loss with sample-level weights between the student and the teacher, as shown in Eq. (11),

$$ L_{ total} = L_{ce} + \lambda_{1} L_{ada\_ce} + \lambda_{2} L_{mimic} , $$
(11)

where the weighted distillation loss with sample-level weights in skin lesion classification \(L_{mimic}\) is represented by Eq. (12), where \(\lambda_{1}\), \(\lambda_{2}\),\(\eta\), \(\beta\), and \(\gamma\) are balance factors. In the distillation loss, we set the temperature hyperparameter \(\tau\) to 3.

$$\begin{aligned} L_{mimic} = L_{kd} + \eta L_{ada\_kd} + \gamma L_{att\_vdms} { } + \beta L_{rkd\_vdms}.\end{aligned}$$
(12)

Experiments

The experimental setup of this paper consists of three parts. Firstly, in Sect. "Comparison Experiment", we compare with related knowledge distillation frameworks to validate the effectiveness of our proposed VAdaKD. It should be noted that AdaKD represents our proposed AdaBoost-based KD with GCN, and VAdaKD represents our proposed AdaBoost-based KD with VDMS. Secondly, in Sect. "Ablation Study", the ablation experiments are designed to observe the impact of the number of auxiliary classifiers and to explore the validity of each component in the distillation loss. Lastly, in subSect. "Visualization", we explore the performance of VAdaKD in mining the learning difficulties at each stage by visualizing the correlation matrix for each stage of the student model and validate the multi-classification performance achieved by VAdaKD through t-distributed stochastic neighbor embedding (t-SNE) visualization.

Our experiments assess the model's performance using four evaluation indicators: Accuracy (Acc), Recall, F1-score, and AUC. Accuracy is a widely used metric in classification tasks that measures the overall classification performance of the model. However, achieving a high accuracy in highly imbalanced data may not be meaningful. AUC helps to make up for this shortcoming and better reflects the classifier's performance. Recall is crucial in the medical field as it indicates the correct classification rate among all diseased samples, essential for timely patient treatment. F1-score is a comprehensive metric that takes into account both Recall and Precision.

Comparison experiment

Extensive experiments are conducted on three benchmark datasets, including the Dermnet dataset, ISIC 2019 dataset, and HAM10000 dataset. The values of \(\lambda_{1}\) and \(\lambda_{2}\) in Eq. (11) are set to 0.5 and 0.3, respectively. In Eq. (12), the value of \(\eta\) is set to 0.25. Furthermore, based on the experimental setup of AT [10] and SP [16], the values of \(\gamma\) and \(\beta\) are set to 5e-7 and 0.3. The framework is implemented in PyCharm using the PyTorch library, with training performed on a single NVIDIA 3090 GPU.

Experimental results on the Dermnet dataset

The Dermnet dataset [36], available on the largest online dermatology resource site, Dermnet, consists of 23 categories of skin lesion diseases. The dataset includes a training set with 15,552 images and a test set with 3,968 images. Table 1 presents a comparison of Accuracy (Acc), Recall, F1-score, and AUC for different distillation methods on the Dermnet dataset using various teacher-student backbones. We reproduce and repeat the experiment 5 times for each distillation framework and show its standard deviation on all metrics. The teacher model is obtained by the pre-training technique provided in subSect. "Pre-training Teacher Model". For a fair comparison, all experimental results are obtained under the guidance of this teacher model. Besides, we introduce the VDMS to AT-based and relationship-based distillation loss to minimize the noise impact from the teacher. Specifically, we first impose a random mask with a mask rate of 0.6 on the student's intermediate feature, then pass through two 3 × 3 convolutional layers and two 1 × 1 convolutional layers, and finally obtain the prediction \({\text{F}}_{l}^{S} \left( \cdot \right)\). Then, according to Eq. (8) and Eq. (10), we get \(L_{att\_vdms}\) and \(L_{rkd\_vdms}\), respectively. The results indicate that our proposed VAdaKD achieves superior performance across all three teacher-student network backbones. Our framework performs very closely to the teacher model when the teacher is ResNet34 and the student is ResNet18.

Table 1 Comparison of Accuracy (Acc), Recall, F1-score, and AUC of various distillation frameworks on the Dermnet dataset when using different teacher-student backbones.

Experimental results on the ISIC2019 dataset

The ISIC2019 Challenge dataset [38, 39] consists of 8 categories of dermatology images. This experiment divides the dataset into training and test sets in an 8:2 ratio. The training set contains 20,224 images, while the test set has 5056 images. Table 2 shows a comparison of the Accuracy (Acc), Recall, F1-score, and AUC of different distillation frameworks on the ISIC2019 dataset using various teacher-student backbones. Similar to the setup on the Dermnet dataset, we introduce the VDMS to AT-based and relationship-based distillation loss to minimize the noise impact from the teacher. It can be observed from Table 2 that our proposed VAdaKD achieves almost the superior results across three different teacher-student backbones, thereby validating the effectiveness of our proposed distillation framework. Furthermore, we compare the Recall performance of our proposed VAdaKD against the comparison methods on the ISIC2019 dataset using logits-based (e.g., DKD), intermediate feature-based (e.g., CAT-KD), relationship-based (e.g., SP) knowledge distillation, and active knowledge distillation (e.g., WLSD), respectively. The teacher and student are ResNet34 and ResNet18, respectively. The results show that our proposed VAdaKD performs better than other comparison methods on hard classes like DF and SCC, which highlights the effectiveness of our proposed VAdaKD in actively identifying learning difficulties against noise interference from the teacher.

Table 2 Comparison of Accuracy (Acc), Recall, F1-score, and AUC of various distillation frameworks on the ISIC2019 dataset when using different teacher-student backbones

Experimental results on the HAM10000 dataset

To further validate the robustness of our proposed method, we also performed comparison experiments on the HAM10000 dataset [37], which comprises seven classes of dermatology images. The training set includes 9,187 images, while the testing set has 828 images. Table 3 presents the comparison of Accuracy (Acc), Recall, F1-score, and AUC of various distillation frameworks applied to the HAM10000 dataset with different backbone architectures. We incorporate VDMS into AT-based and relationship-based distillation loss to mitigate the impact of noise from the teacher. The results in Table 3 demonstrate that our proposed VAdaKD outperforms other state-of-the-art knowledge distillation methods across various teacher-student backbones, highlighting the effectiveness of our distillation framework.

Table 3 Comparison of Accuracy (Acc), Recall, F1-score, and AUC of various distillation frameworks on the HAM10000 dataset when using different teacher-student backbones

Number of student model parameters and flops

The number of parameters and flops of all comparison methods and our proposed model during training and testing are detailed in Table 4. Auxiliary classifiers are incorporated during the training phase to form base classifiers and train the student using the AdaBoost method. Including auxiliary classifiers in our proposed model increases the number of parameters during training compared to testing. However, these auxiliary classifiers are not used during inference in the testing phase. Deploying our dermatologic diagnostic model to mobile devices will have more significant potential for clinical applications.

Table 4 Number of parameters and flops during training and testing phases of all comparison methods and our proposed method

Variational difficulty mining strategy in skin lesion classification at different stages

In order to identify nuanced learning difficulties against noise interference from the teacher, we introduce a variational difficulty mining strategy (VDMS). VDMS first utilizes the GCN to capture nuanced higher-order information. It constructs a nearest-neighbor relationship matrix \(A_{l}\) to calculate the information of the current node's \(l\) th hop and then shares it with the student. The nearest-neighbor relationship matrix \(A_{l}\) is derived from the teacher model to ensure consistency in the relationship between the student and teacher. Next, VDMS eliminates noise interference from these nuanced difficulties by maximizing the mutual information between the teacher and student. Figure 3 demonstrates the efficacy of our proposed method to capture nuanced higher-order information, and therefore it performs better in difficult category classification performance. The results indicate that our proposed method outperforms others in acquiring challenging knowledge and enhancing classification performance on complex samples. To validate the effectiveness of our proposed VDMS strategy, we conducted the corresponding experimental study, and the detailed results are presented in Table 5. We employ BoostNet as a baseline on the Dermnet dataset. 'Stu + AdaBoost' denotes that the student model is trained using AdaBoost but without distillation; 'Stu + AdaBoost + GCN' denotes that the student model is trained using AdaBoost along with GCN but without distillation. Meanwhile, AdaKD represents our proposed AdaBoost-based KD with GCN, and VAdaKD represents our proposed AdaBoost-based KD with VDMS, respectively. Table 5 shows that our proposed VAdaKD achieves satisfactory results. The results demonstrate that VAdaKD performs better than AdaKD, indicating the effectiveness of the VDMS strategy for identifying increasing learning difficulties to some extent at different stages.

Fig. 3
figure 3

Recall performance of our proposed VAdaKD against the comparison methods on the ISIC2019 dataset. The comparison methods include logits-based (e.g., DKD), intermediate feature-based (e.g., CAT-KD), relationship-based (e.g., SP) knowledge distillation, and active knowledge distillation (e.g., WLSD), respectively. ResNet34 acts as the teacher, and ad ResNet18 acts as the student. It is worth noting that our VAdaKD performs better in hard classes like DF and SCC, which highlights the effectiveness of actively identifying learning difficulties against noise interference from the teacher

Table 5 Comparison of the effectiveness of the VDMS strategy in skin lesion classification at different stages

Ablation Study


Effect of the number on auxiliary classifiers. We investigate the impact of appending auxiliary classifiers on the performance of the distillation framework in skin lesion classification. We gradually increase the number of auxiliary classifiers and evaluate their influence on the Dermnet dataset. We use the student model along with vanilla knowledge distillation, KD as the benchmark. From the Fig. 4, it is evident that the best results are obtained when auxiliary classifiers are added for all stages. This result demonstrates that AdaBoost boosting theory can effectively train auxiliary classifiers across different stages sequentially, thereby improving the performance of the student model in skin lesion classification.

Fig. 4
figure 4

Ablation experiments on the Dermnet dataset. We progressively increase the number of auxiliary classifiers for the student model ResNet18 along with vanilla knowledge distillation KD as the baseline. The best results are obtained when auxiliary classifiers are added across all stages

Effect of distillation loss on each component. We conduct an ablation study to investigate the contribution of each component for the weighted distillation loss with sample-level weights, \(L_{mimic}\) during the distillation process on the Dermnet dataset. The student model (without KD) is utilized as the baseline while maintaining the same task loss of true-label supervision \(L_{ce}\) and AdaBoost loss of true-label supervision \(L_{ada\_ce}\). The experimental results are presented in the left panel of Fig. 5. We take the student baseline as the benchmark and gradually append the logits-based distillation loss. This loss includes two forms: the KL loss \(L_{kd}\) between soft outputs with temperature for the last layer of the student and teacher, and the distillation loss \(L_{ada\_kd}\) of the soft outputs with temperature for the last layer of the student–teacher correspondence auxiliary classifiers. Additionally, we introduce the intermediate feature-based distillation loss \(L_{att}\) and the relationship-based distillation loss \(L_{rkd}\). Our experiments show that each component positively contributes to the final recognition performance, highlighting their effectiveness in leveraging different forms of knowledge embedded in the teacher's model.

Fig. 5
figure 5

Ablation experiments on the Dermnet dataset. The left panel shows ablation experiments of weighted distillation loss with sample-level weights \(L_{mimic}\), which examines each component's role in the distillation process based on the student ResNet18. The right panel illustrates ablation experiments of the AT-based and relationship-based distillation loss with the VDMS strategy, which examines VDMS's role in the distillation process based on the student ResNet18


Effect of the variational difficulty mining strategy. We introduce the VDMS to AT-based distillation loss and get the AT-based distillation loss with VDMS, \(L_{att\_vdms}\) according to Eq. (8). Similarly, we introduce the VDMS to relationship-based distillation loss and obtain \(L_{rkd\_vdms}\) according to Eq. (10). Our experiment results show that each component positively contributes to the final recognition performance, indicating that VDMS can identify nuanced learning difficulties against noise interference from the teacher.


Effect of different hyperparameters. The effect of various hyperparameters is investigated through experiments on the loss balancing factors outlined in Eq. (11) and (12) using the HAM10000 dataset. The results of these experiments for different factor values are detailed in Tables 6 and 7. It should be mentioned that the hyperparameters \(\gamma\) and \(\beta\) in Eq. (12) are set empirically with reference to prior works AT [10] and SP [16] and thus are not included in the experimental analysis. Ultimately, the values of \(\lambda_{1}\) and \(\lambda_{2}\) are set to 0.5 and 0.3, and the value of \(\eta\) is set to 0.25 in our proposed method.

Table 6 Effect of different balancing factors on student model accuracy (%) on the HAM10000 dataset
Table 7 Different balancing factors affect the accuracy (%) of the ResNet18 student model on the HAM10000 dataset.

Visualization

The visualization performance of our proposed VAdaKD is presented in this subsection. We examine VAdaKD's ability to identify nuanced learning difficulties in the presence of noise interference from the teacher across different stages. Correlation visualization results are showcased using the same batch of dermatology samples as input, with vanilla KD and AdaKD serving as benchmarks (refer to Fig. 6). The results reveal that VAdaKD mines more relevant information at stage 1 and stage 2 in comparison to vanilla KD. It suggests that VAdaKD can effectively utilize inter-sample information in these two stages. At stage 3, VAdaKD has a weaker inter-sample correlation than AdaKD, but at stage 4, it perceives richer inter-sample details. This suggests that at stage 4, VAdaKD identifies more nuanced classification difficulties by mining multi-hop information between dermatology samples against noise interference from the teacher.

Fig. 6
figure 6

Comparison of the visualization results of the correlation between KD, AdaKD, and VAdaKD at different stages using the same batch of dermatology samples as input. The visualization results of the correlation from stage 1 to stage 4 are represented in columns 1 to 4, respectively

To further demonstrate the discriminative representation of VAdaKD, we utilize t-SNE to visualize the multi-classification effect of our proposed VAdaKD in skin lesion classification. We illustrate the effectiveness of VDMS in skin lesion classification along with logits-based distillation, intermediate features-based distillation, and relationship-based distillation. In addition, we also visualize the results of BoostResnet, Stu w/o GCN, and Stu w GCN, as presented in Fig. 7. From the figure, it is evident that VAdaKD achieves a more pronounced 'high cohesion-low coupling' classification effectiveness on the Dermnet dataset. For example, VAdaKD outperforms other distillation frameworks significantly in the 14th and 22nd categories.

Fig. 7
figure 7

Comparison of our proposed VDMS strategy in VAdaKD along with GCN strategy and three knowledge distillation methods in skin lesion classification. VAdaKD achieves a more 'high cohesion-low coupling' effect of classification on the Dermnet dataset. However, the dispersion between multiple categories is slightly weaker in the other methods. VAdaKD outperforms the other distillation frameworks for categories 14, denoted by , and 22, denoted by , indicating its superior performance. 'Stu w/o GCN' denotes that the student model is trained using AdaBoost but without distillation, and GCN strategy is not adopted; 'Stu w GCN' denotes that the student model is trained using AdaBoost along with GCN but without distillation

Conclusion

This research proposes a Variational AdaBoost Knowledge Distillation framework, VAdaKD, which utilizes an advanced knowledge distillation paradigm. This paradigm enables the student to actively acquire and extract valuable knowledge for classifying dermatology samples at the current stage. The VDMS strategy helps reduce noise interference from the teacher by maximizing the mutual information between the teacher and student. Our proposed VAdaKD framework demonstrates superior performance on Dermnet, ISIC2019, and HAM10000 datasets compared to other knowledge distillation methods. Experimental results underscore the efficacy of the Active Knowledge Distillation framework in the classification of dermatology images.

Globally, 1.9 billion people are impacted by skin diseases. With a scarcity of dermatologists, many turn to general practitioners for dermatologic care, leading to less precise diagnoses. This research focuses on developing lightweight dermatology diagnostic models for mobile devices. Implementing these models could significantly enhance the practicality of dermatologic diagnostic systems, facilitating early screening and enhancing diagnosis and treatment in rural regions, ultimately leading to substantial economic advantages.