1 Introduction

Deep learning algorithms exhibit remarkable capabilities in computer-aided detection and diagnosis (CAD) across diverse applications [1], including disease classification [2,3,4], segmentation [5, 6], or medical object detection such as pulmonary nodules [7] or lymphocytes [8], among others. In particular, the emergence of annotated X-ray imaging datasets [9,10,11] has made the research of many applications based on deep neural networks possible, greatly benefiting pathology diagnosis and prognosis.

Nevertheless, the performance of models trained on medical images highly depends on several factors that can notably worsen the results. Key challenges include the scarcity of annotated data and the substantial cost associated with expert labeling [12]. Compared to regular datasets in computer vision, a medical image dataset usually contains relatively few images, and in some cases, only a small percentage of them are annotated by experts [1]. In addition, there is commonly a considerable imbalance between negative (healthy) and positive (pathological) samples. Moreover, generated models strongly rely on the specific domain of data for which they were trained. All these challenges collectively hinder the development of effective, robust, and generalizable methods for processing medical images [13], and only a few approaches based on deep learning techniques are eventually certified for clinical usage [14].

A standard solution to deal with the scarcity of annotated medical imaging data due to its associated high costs is data augmentation [15, 16]. This technique generates synthetic samples from existing images, expanding the training dataset. However, the distinctive characteristics of medical images, such as their high dimensionality, intricate structures, and substantial inter- and intra-class variability, present challenges when applying traditional data augmentation techniques [17]. Therefore, designing effective augmentation strategies for medical imaging often requires domain expertise involving radiologists or medical professionals who can provide guidance and validation.

Another widely adopted solution for addressing the limited availability of annotated data is using transfer learning [18]. This technique involves leveraging knowledge acquired from a domain with sufficient labeled data and applying it to another domain by fine-tuning the model. In this process, the weights of a pre-trained model are used as an initialization for a new model. Transfer learning has gained significant interest for medical imaging [19,20,21]. Notably, it helps reduce the amount of labeled data required for training, accelerates convergence, and yields models with better generalization capabilities. These generalized models can be effectively transferred to other domains, enabling inter-domain use.

The issue of highly imbalanced data is another common challenge in medical imaging, where the number of positive samples is often significantly lower than that of negative ones [22]. Machine learning models trained on imbalanced data tend to exhibit bias towards the majority class, not paying attention to the samples from the minority class [23]. Consequently, this leads to suboptimal performance for the underrepresented samples, which can have severe consequences in detecting specific pathologies and could represent a risk for the patients in critical scenarios.

A small dataset becomes even more prone to overfitting, making the model lose generalization capabilities when the training data is not large enough. Few-shot learning (FSL) algorithms address this issue. These methods can be categorized [24] into metric-based, optimization-based, and transfer learning-based approaches. Metric-based FSL learns a representation by comparing training examples through Siamese networks [25], matching networks [26], prototypical networks [27], or relation networks [28]. Optimization-based FSL [29] can learn the parameters of any standard model via meta-learning in such a way as to prepare that model for fast adaptation. These techniques include Model-Agnostic Meta-Learning (MAML) [29], LSTM-based meta-learner models [30], and Proto-MAML [31]. Finally, transfer learning-based approaches include fine-tuning [32] and linear models learned on top of a pre-trained embedding [33], such as k-Nearest Neighbor (kNN) [34], Support Vector Machine (SVM) [35], or Random Forest (RF) [36].

Although FSL has been studied extensively, only a few of these techniques [37] have been investigated for medical imaging. In [38], a MAML algorithm is adopted for a few-shot problem with medical images, and the Dice loss function is used to mitigate class imbalance. Different FSL methods are compared in [24] for the skin condition recognition problem in which class imbalance exists, showing that when combined with conventional imbalance techniques, they lead to better performance, especially for the rare classes.

There are some previous works evaluating few-shot approaches using COVID-19 X-ray images. In [2], the effects of different k-way, n-shot configurations, and loss functions are examined using the dataset from Dr. Cohen [39] to identify positive COVID-19 images. In [40], Siamese networks are also explored using a dataset with 226 positive images from [41]. In [42], a Siamese network with triplet loss is used to classify CT scan images into Normal, COVID-19, and Community-Acquired Pneumonia.

While most former COVID-19 X-ray image classification approaches are focused on FSL architectures, in this proposed work we evaluate how a series of techniques affect these architectures. The main objective of this work is to investigate the accuracy of learning-based models in the medical imaging domain, focusing on their behavior in imbalanced few-shot scenarios. In [43], we studied the effect of different techniques to deal with imbalanced data but for scenarios with sufficient samples. The evaluation was performed on different chest X-ray datasets labeled with COVID-19 positive and negative diagnoses. Here, we extend this previous work by proposing and evaluating similar techniques but adapted to the few-shot learning paradigm with imbalanced data. In particular, we use a metric-based FSL method based on Siamese networks [44] in which a series of proposals are integrated to mitigate the effects of few and imbalanced data, including different initialization methods, transfer learning, data augmentation, four proposals adapted to Siamese neural networks to deal with imbalanced data, and four alternative classifiers to carry out the final prediction.

To carry out the evaluation, four publicly available chest X-ray image datasets [9,10,11, 45] are considered. Three corpus pairs are created from these, each containing positive and negative samples of COVID-19 patients. The performance of these techniques is evaluated in both intra-domain (within the same domain) and inter-domain (across different domains) use cases, as well as for four levels of data imbalance. The results of the experiments carried out show that the low number of parameters due to the shared weights of both Siamese networks, along with the included proposals, improve the results and reduce the tendency to overfit and the amount of data required for training.

The remainder of the paper is organized as follows: Sect. 2 outlines the proposed approach to address the challenges discussed earlier; Sect. 3 presents the experimental setting used to evaluate the approach, including details about the datasets used for experimentation; Sect. 4 presents and analyzes the evaluation results obtained from applying the proposed techniques; and Sect. 5 finally concludes the paper by summarizing the key findings and contributions of the study. Additionally, it outlines potential directions for future research in the medical imaging domain and the challenges that remain to be addressed.

2 Methodology

This section describes the methodological proposal to address the challenges that learning-based methods commonly face when dealing with medical image datasets. These challenges, as mentioned, are mainly data scarcity and intrinsic imbalance according to the data distribution.

Fig. 1
figure 1

Diagram with the pipeline of the process. The proposed techniques to be studied are highlighted in yellow

Figure 1 illustrates the pipeline steps followed during the training and inference stages. Formally, let \(\mathcal {T} = \left\{ \left( I_{i}, c_{i}\right) : I_{i}\in \mathcal {I}, c_{i}\in \mathcal {C}\right\} _{i=1}^{\left| \mathcal {I}\right| }\) represent a set of labeled images where \(\mathcal {I}\) denotes the input data space and \(\mathcal {C}\) the set of possible categories. Let also \(\zeta : \mathcal {I}\rightarrow \mathcal {C}\) be the function that relates the input image \(I_{i}\) with its associated class \(c_{i}\), i.e., \(\zeta \left( I_{i}\right) = c_{i}\).

During the training phase, the aim is to learn an approximation of \(\zeta \), denoted by \(\hat{h}_w\), implemented through a learning-based network parameterized with a set of weights w. To learn \(\hat{h}_w\), the training set \(\mathcal {T}\) is used to minimize the network error according to a given loss function \(\mathcal {L}\). This work analyzes the improvement brought to this learning process by different techniques that address the challenges posed.

In the proposed pipeline, input data is first processed to balance the sampling and adjust the data distribution of \(\mathcal {T}\). A data augmentation process is also considered to generate more training samples artificially. This preprocessed data is then used to learn the function \(\hat{h}_w\), for which a Siamese architecture is considered, as it is specially devised for few-shot scenarios. Different initialization techniques, including transfer learning, are also studied in this step. Besides, a weighted loss function \(\mathcal {L}_w\) is introduced to address the imbalance and improve the model training further.

Once the training is completed, the inference stage is carried out. Given a set of query data \(\mathcal {Q} = \left\{ \left( I_{q}\right) \right\} \subset \mathcal {I}\times \mathcal {C}\), inference is performed by considering the estimated function \(\hat{h}_w\) to calculate the final prediction \(\hat{c}_q\), i.e., \(\hat{h}_w\left( I_{q}\right) = \hat{c}_q\). For this, a new model \(\hat{h}_w\) is generated from the weights w of one of the parallel networks of the Siamese architecture. The network then processes the query sample \(I_q\) to extract its embedding representation, which is compared with the embeddings (also called Neural Codes or NC) obtained for the training set \(\mathcal {T}\) to compute the final prediction.

The following sections provide a detailed explanation of each step of this process, starting with the definition of the Siamese architecture.

2.1 Siamese architecture

The Siamese architecture consists of two identical parallel networks with shared weights, which process two input images to determine whether they are equal. This configuration is especially suitable for few-shot learning scenarios for two main reasons. On the one hand, it simplifies the task as it only aims to determine the similarity of the images and not the class. On the other hand, the pair-wise arrangement of the set \(\mathcal {T}\) increases the number of samples used to train the model (in practice, \(M = \left( {\begin{array}{c}|\mathcal {T}|\\ 2\end{array}}\right) \) possible pairs may be generated). Therefore, this arrangement results in greater input data variability, favoring convergence.

Let \(\mathcal {P} = \left\{ \left( \left\{ I_{a}, I_{b}\right\} , y_i\right) : \left\{ I_{a}, I_{b}\right\} \in \mathcal {I}, y_{i}\in \mathcal {Y}\right\} _{i=1}^{M}\) represent the set with all possible pairs of images \(\left\{ I_{a}, I_{b}\right\} \) drawn from the defined input space \(\mathcal {I}\) and \(y_{i}\in \mathcal {Y}\) be a binary indicator depicting whether the input pair is similar or different. The Siamese architecture initially maps the input pair \(\left\{ I_{a}, I_{b}\right\} \) using the networks \(h_w\) to a new N-dimensional space \(\mathcal {X} \in \mathbb {R}^{N}\), obtaining the feature vectors \(\textbf{x}_{a}\) and \(\textbf{x}_{b}\), respectively. In this new space, given a dissimilarity metric \(d: \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}^{+}_{0}\), a similitude score \(D_{w}\) between \(\textbf{x}_{a}\) and \(\textbf{x}_{b}\) is calculated. This value is meant to be zero when the images are equal and move away proportionally according to the degree of dissimilarity. Note that \(D_{w}\) should be thresholded (either heuristically or with a learning-based method) to establish whether the inputs are similar. The block labeled “Siamese” in Fig. 1 shows a graphical scheme of this architecture.

The Siamese networks are trained using the so-called contrastive loss which, for a single pair of data \(\left( I_{a}, I_{b}\right) \), is defined as:

$$\begin{aligned} \mathcal {L}\left( w,\left( y,I_{a},I_{b}\right) \right) = \left( 1-y\right) \cdot D^{2}_{w} + y \cdot \max \left( 0, m - D_{w} \right) ^2 \end{aligned}$$
(1)

where \(D_{w}\) stands for the dissimilarity value between input elements, i.e., \(D_{w} = d\left( x_{a},x_{b}\right) \), y for the binary class-matching indicator, and m represents a separation margin following the proposal by Hadsell et al. [46] to define a hinge or maximum margin loss.

From this, the total loss \(\mathcal {L}_{S}\) can be calculated as the sum of the partial losses for each pair in \(\mathcal {P}\), i.e., \(\mathcal {L}_{S} = \sum _{i=1}^{|\mathcal {P}|}\mathcal {L}\left( w,\left( y, I_{a}, I_{b}\right) ^{(i)}\right) \).

In this context, this work studies the performance of this scheme in imbalance few-shot scenarios and the improvement that different additional mechanisms bring to this process, such as initialization techniques, transfer learning, data augmentation, and proposals to balance the data distribution, as introduced in the following sections.

2.2 Siamese initialization

In a few-shot learning scenario, the initialization of the neural network weights plays a crucial role since it can influence both the final result and the number of samples needed for training [47]. To assess its effect on the task at hand, three initialization strategies are studied:

  • Training from scratch: The network is initialized with random weights, leading to a learning process that begins from scratch. This approach typically requires a larger set of labeled data for the model training to converge.

  • Initializing the network with ImageNet pre-trained weights: Although it is a very different domain, leveraging knowledge from this large-scale dataset reduces the training time and data requirements, potentially accelerating the learning process and improving the results obtained.

  • Transfer learning: This approach initializes the network using the weights obtained with a similar X-ray dataset for which there is a larger availability of labeled data and then applies a fine-tuning process to the target distribution. In this way, training starts from a good initialization and can benefit from the knowledge extracted from a closer domain while adapting to the particularities of the new data. Note that, in this case, due to the larger quantity of data, the initial training may be carried out on the \(h_w\) backbone used in the Siamese (without pairwise training) and then construct the Siamese architecture from this.

2.3 Data augmentation

Due to its good results, data augmentation has become a de facto standard in training learning-based methods. This technique increases the size and diversity of a training dataset by applying transformations to the existing samples, which may include rotations, skew, scaling, cropping, flipping, and contrast or color adjustments, among others. The introduced variability improves the trained models’ robustness and generalizability and reduces overfitting, making it a valuable tool for small training sets.

However, the effectiveness of each transformation largely depends on the specific task to be solved. In the context of medical imaging, its unique properties require a more cautious approach when applying data augmentation [15, 48]. Some inappropriate transformations can hide or alter certain findings that could be key to diagnosing a pathology (for example, a flip operation would change the heart’s position). Consequently, we have considered a limited set of transformations that do not alter the shape or invert the position of elements in the image. Specifically, the effect of the following set of transformations is studied as the value of the \(\alpha \) parameter increases:

  • Horizontal and vertical shifts (in the range of \([-\alpha , \alpha ]\)% of the image size).

  • Scaling (in the range of \([-\alpha , \alpha ]\)% of the original image size).

  • Rotations (in the range of \([-\alpha ^\circ , \alpha ^\circ ]\)).

2.4 Imbalanced data

While previous sections have focused on solutions for small training sets, this section describes the techniques aimed at dealing with data imbalance. For this, four proposals are assessed: balancing the sample distribution, weighting the loss function, combining balancing with the loss, and modifying the ratio of positive and negative pairs. Note that when we talk about positive and negative pairs in the Siamese network, we mean pairs of images belonging to the same class and pairs of images of different classes, respectively, regardless of whether they represent sick or healthy cases.

As previously indicated, the total number of possible training pairs is calculated as \(M = \left( {\begin{array}{c}|\mathcal {T}|\\ 2\end{array}}\right) \), encompassing both positive and negative pairs. Accordingly, this total can be split into \(M = M_P + M_N\), the sum of possible positive pairs (denoted as \(M_P\)) and negative pairs (\(M_N\)). From this, we define the imbalance ratio between positive and negative pairs as \(r={M_P}\big /{M_N}\), which will be perfectly balanced when \(r=1\). The balanced sampling proposal aims to obtain an index of \(r=1\) by equalizing the number of samples from each class (healthy or with COVID-19) so that the possible positive and negative pairs are balanced. This approach is analogous to the Oversampling technique explored in our previous work [43], which involved duplicating the samples of the minority class. However, in this case, it is adapted to the requirements of Siamese networks. Undersampling is not considered since it has been proven to yield poor results, which would be even worse in this scenario with few data.

A second proposal to deal with imbalanced distributions is to weight the loss function during the training stage. Specifically, this technique increases the value of the error committed for the minority classes to balance their contribution to the overall error. This forces the training process to treat all classes equally and prevents creating a bias towards the majority class. As far as we know, there are no proposals to weight the contrastive loss used by Siamese networks. For this reason, we propose to modify Eq. 1 by introducing the following weighting factor:

$$\begin{aligned} \mathcal {L}_w = \frac{\lambda _{\zeta (I_a)} + \lambda _{\zeta (I_b)}}{2} \left( \left( 1-y\right) \cdot (D_{w}(I_a, I_b))^{2} + y \cdot \max \left( 0, m - D_{w}(I_a, I_b) \right) ^2 \right) \end{aligned}$$
(2)

where the parameters \(\lambda _{c_i}\) represent the factors used to weight the classes \(c_i\) of each sample \(I_a\) and \(I_b\), respectively, recovered as \(c_a = \zeta (I_a)\) and \(c_b = \zeta (I_b)\). \(\lambda _{c_i}\) is calculated as the quotient of the total training samples \(|\mathcal {T}|\) divided by the number of classes \(|\mathcal {C}|\) multiplied by the number of samples of the class \(c_i\), i.e. \(|\mathcal {T}|_{c_i}\). This weighting factor can be expressed as:

$$\begin{aligned} \lambda _{c_i} = \frac{|\mathcal {T}|}{|\mathcal {C}| \cdot |\mathcal {T}|_{c_i}} \end{aligned}$$
(3)

As a third proposal, we will study the combined effect of applying balanced sampling and the weighted loss function.

Finally, modifying the balance of pairing of positive and negative examples used during network training is also proposed. Instead of generating a set \(\mathcal {P}\) with the same number of positive and negative pairs, it is suggested to change this proportion so that the network, for example, sees many more negative pairs than positive ones (or vice versa). This technique also modifies the data distribution since it requires drawing a sample from each class to create negative pairs. Consequently, the instances from the minority class will be repeated.

2.5 Inference stage

The Siamese architecture is designed to determine a similarity score that correlates the embedded representations of input elements rather than directly retrieving class labels for classification tasks. Therefore, the following procedure is usually considered to adapt Siamese schemes for classification purposes: given a query sample denoted as \(I_{q}\), the distances between this item and the entire training set \(\mathcal {T}\) are computed in the embedded representation space \(\mathcal {X}\). The query \(I_{q}\) is eventually assigned with the label \(\hat{c}_{q}\), which corresponds to the label of the element that exhibits the minimum distance value. This process can be expressed as follows:

$$\begin{aligned} \hat{c}_{q} = \zeta \left( {{\,\mathrm{arg\,min}\,}}_{\forall I_{i}\in \mathcal {T}} d\left( h_w(I_{q}), h_w(I_{i})\right) \right) \end{aligned}$$
(4)

In addition to this approach (which we will refer to as Histogram), it is proposed to study the improvement provided by using a model learned using the embeddings generated by the Siamese network. This technique could be considered a transfer-learning approach according to the literature [33]. Specifically, the trained \(h_w\) network is used to transform the inputs to the embedded representation space \(\mathcal {X}\), on which three alternative methods are applied to calculate the final correlation:

  • k-Nearest Neighbor (kNN) [34]: This algorithm categorizes the given query \(I_{q}\) by identifying the prevailing class among the k nearest elements to it. For this, a dissimilarity metric is used to compare the embedding of the query with those of the training set (NC in Fig. 1).

  • Support Vector Machine (SVM) [35]: This approach transforms the original data into a higher-dimensional space using a specified kernel function. Subsequently, it learns a hyperplane to distinguish between the classes.

  • Random Forest (RF) [36]: This method constructs an ensemble classifier from individual decision trees, each trained on random data subsets. The final output amalgamates the decisions from each tree to calculate the class of the input query.

3 Experimental setup

This section details the experimental setup, including the selection of datasets, the network architecture and the parameters chosen, the training process details, and the evaluation metrics employed.

3.1 Datasets

The methodology was assessed using four distinct datasets.Footnote 1 An overview of these datasets is presented in Table 1, indicating the types of samples they contain (negative (−) or positive (\(+\)) COVID-19 samples), along with the original sizes of the training and evaluation sets. Example images from these datasets are shown in Fig. 2.

Table 1 The initial configuration of the datasets under consideration is as follows, showing the type of samples (positive \(+\) and negative − COVID-19 patients), the number of samples per class, and their total (\(\sum \)). Additionally, the size of the training and test sets is provided, along with the percentage of each set compared to the total size
Fig. 2
figure 2

Illustrative samples from the evaluated datasets

As seen in Table 1, two of the datasets exclusively contain negative samples of COVID-19 patients. The other two include both classes, although they exhibit notable class imbalance. Three combinations were made from these data to evaluate the proposed methodology, creating new datasets with positive and negative samples, as presented in Table 2. This table introduces an acronym for each combination (to be used in the experimentation section) and specifies the number of positive and negative samples in each newly generated set. As in the previous work [43], the number of samples added from the original datasets was limited to 10,000 to ease the experiments. Additionally, the “mean imbalance ratio” (MeanIR) index is provided to indicate the imbalance level of the corpus [49]. The MeanIR value ranges from \(\left[ 1, \infty \right) \) and denotes a higher imbalance as the value increases. For this two-class (binary) task, this is defined as \(\text {MeanIR} = (1+|\mathcal {T}_-|/|\mathcal {T}_+|)/2\), where \(\mathcal {T}_-\) represents the number of samples in the majority class (which in this case are the healthy or COVID-negative patients) and \(\mathcal {T}_+\) is the number of samples in the minority class (COVID-19 positive patients).

Table 2 Description of the new combined datasets derived from Table 1. They include the acronym, partition sizes, the count of positive (\(+\)) and negative (−) COVID-19 samples, and their respective percentages. The MeanIR, an indicator of dataset balance, is also provided

Note that since the size of the training partitions in these corpora does not meet the requirements of a few-shot learning scenario, we artificially reduce their size while leaving the test sets unaltered. Specifically, for the experimentation, 10-fold cross-validation was carried out, selecting for each fold 100 random samples without repetition from the majority class (healthy patients) and n random samples from the minority class (COVID-19+ patients). For the value of n, four possible imbalance scenarios were considered: High imbalance with \(n=1\), Medium imbalance with \(n=10\), Low imbalance with \(n=50\), and No imbalance with \(n=100\). In addition, the effect of the proposed techniques is also studied when the number of samples is increased to 200 and 300 while maintaining the level of imbalance. Note that, in all cases, the evaluation was carried out with the complete test set as indicated in Table 2.

3.2 Network architecture

The proposed methodology was assessed using ResNet-50 v2 [50]—the same architecture as the baseline method [43], to allow a fair comparison—as the backbone for the \(h_w\) Siamese parallel networks. This standard architecture for image classification is known for its state-of-the-art results in various benchmarks and applications [51]. This updated version of ResNet-50 incorporates identity shortcuts and pre-activation units, enhancing performance and reducing overfitting.

Regarding the rest of the configuration details of the Siamese architecture, the Euclidean distance was considered as dissimilarity function d (i.e., \(D_w = \sqrt{(h_w(I_a)-h_w(I_b))^2}\)) and the \(\ell _2\) normalization [52] for the regularization of the embedded representations.

For the margin parameter m of the loss function (see Eq. 1), initial experimentation was carried out considering values in the range \(m \in [0, 8]\), obtaining low results for the extremes of this range. The value of \(m=1\) was eventually selected for the rest of the experimentation, as it reported the best results overall.

Throughout all the experiments, the Siamese networks were trained for 200 epochs with a batch size of 32 images. Stochastic Gradient Descent [53] was employed for parameters optimization with a Nesterov momentum of 0.9, a learning rate of \(10^{-2}\), and a decay factor of \(10^{-6}\). The images were scaled to 224\(\times \)224 pixels, and their values were normalized within the range [0, 1] to aid model convergence. The values in this setting match the baseline to ensure a fair comparison.

3.3 Metrics

For the quantitative evaluation, we used the F-measure (\(\text{ F }{1}\)) as the figure of merit to mitigate potential biases caused by significant label imbalances in the considered datasets. In a binary classification scenario, \(\text{ F }{1}\) is calculated as the harmonic mean of Precision (P) and Recall (R). The definitions of these metrics are as follows:

$$\begin{aligned} \text{ P }&= \frac{\text{ TP }}{\text{ TP } + \text{ FP }} \end{aligned}$$
(5)
$$\begin{aligned} \text{ R }&= \frac{\text{ TP }}{\text{ TP } + \text{ FN }}\end{aligned}$$
(6)
$$\begin{aligned} \text{ F}_{1}&= \frac{2 \cdot \text{ P } \cdot \text{ R }}{\text{ P } + \text{ R }} = \frac{2\cdot \text{ TP }}{2\cdot \text{ TP } + \text{ FP } + \text{ FN }} \end{aligned}$$
(7)

where TP, FP, and FN denote the number of true positives, false positives, and false negatives, respectively.

The evaluation involved binary-class experiments, so the results are reported in terms of macro-\(\text{ F}_{1}\) for a comprehensive assessment. Macro-\(\text{ F}_{1}\) is computed as the average of the \(\text{ F}_{1}\) scores obtained for each class.

4 Results

This section evaluates the proposed methodology using the datasets, network configuration, and metrics described previously. The results of each technique presented before being applied on the network of Fig. 1 are analyzed individually to provide a comprehensive assessment. The section starts with the effects of the initialization process, then delves into data augmentation analysis, contrasts techniques for data imbalance, compares inference classifiers, and examines the influence of training set size. Finally, the section includes a discussion with concluding remarks, comparing the few-shot learning scenario with the results from the prior study that explored techniques without labeled data constraints.

In all cases, results are analyzed at intra- and inter-domain levels and for four imbalanced data distributions. These distributions are referred to with the initials H, M, L, and N, being H \(\rightarrow \) High imbalance (100/1), M \(\rightarrow \) Medium (100/10), L \(\rightarrow \) Low (100/50), and N \(\rightarrow \) None (100/100).

4.1 Initialization

As a recap of the pipeline presented in the methodology, one way to cope with small datasets is to use a good initialization of the network weights before starting the training process. In this section, we will focus on studying the effects of different initialization techniques. First, a baseline result is obtained by training the Siamese ResNet-50 v2 backbones from scratch, i.e., using random values as initialization parameters. It is compared to a pre-initialized model whose weights are obtained from a generic dataset—in this case, ImageNet [54]—that is then fine-tuned with our datasets. For the sake of simplicity, we will refer to them as scratch and pre-initialized models.

Table 3 shows the macro-\(\text{ F}_{1}\) results of both approaches, scratch and pre-initialized with weight initialization, for the four levels of data imbalance considered. This table also includes detailed results for each possible training-to-evaluation dataset combination. The “From” column indicates the training source, whereas “To” refers to the evaluation set. Hence, we evaluate cases within the same domain (intra-domain), which are underlined, and also inter-domain cases in which the model is assessed on domains different from its training source. The best result per experiment and imbalance level is marked in bold, i.e., the best figure obtained according to the initialization method, either from scratch or pre-initialized. For instance, the value 45.1 appears in bold in the first column (corresponding to the BIMCV-COVID test set) because the training from scratch approach is better than the weight initialization (which obtains a 44.7 in this case). On the contrary, the pre-initialized model achieves higher performance in the high imbalance cases for the Chest-Git (with 42.2) and Pad-BIM (with 46.0) test sets.

From a global perspective, the results show that, in most cases, the performance of the pre-initialized network achieves better results, especially for the cases with High, Medium, and Low imbalance. Regarding the None imbalanced experiments, the results obtained are quite similar for both initialization approaches. This makes it clear that the architecture presented can learn efficiently regardless of initialization, even for this low-data scenario. The high variability generated by possible combinations of training pairs makes it not so dependent on initialization. However, in the case of High imbalance, this architecture appears to struggle with convergence during training, as a single example from the minority class may be insufficient. These results improve progressively as the level of imbalance decreases. It is also noteworthy that the average intra-domain results are promising starting from a Medium imbalance, especially considering that it is a few-shot scenario.

Table 3 \(\text{ F}_{1}\) results achieved by training the model from scratch and initializing with ImageNet weights. In each scenario, the intra-domain cases are underlined for clarity. Each case is analyzed considering four levels of imbalance: High, Medium, Low, and None

To further analyze the effect of initialization, we will now examine the impact of transfer learning by pre-training with an alternative X-ray dataset, which may be considered another technique to address the data scarcity issue. The results of this experiment are shown in Table 4. Based on the weights obtained with ImageNet, a pre-training is performed with a dataset from a similar domain (“Pre-trained” column), for which a larger amount of labeled data is available (in this case, considering 1700 training instances). Then, a fine-tuning process is carried out to the source dataset (“From” column) and evaluated for the target set (“To” column). As before, four imbalance levels are assessed, from High to None. Similarly to the previous table, bold values refer to the best performance, but in this case, they are compared to the best initialization method reported in Table 3. For example, the value 48.0 in the first row and column High of Table 4 appears in bold because the best initialization value for this same case in Table 3 is lower (42.2). However, the first value in the second column, 47.9, is not marked because the corresponding one in Table 3 obtains a better result (61.7) for weight initialization training.

Knowing this, we can see that transfer learning improves, in general terms, the previous results of the pre-trained models. This suggests that parameters from a network trained with data of a similar typology help to find good features for the task at hand. If we pay attention to the Medium column, most values are not better than the previous ones. This may happen because the network has to re-learn features from the training set (“From” dataset) although it has very little positive data, i.e., the minority class is hardening the task of differentiating the classes. In the High imbalance, however, only one sample of positive data will not affect much in the re-training process. Even though these results are somewhat better, similar to before, it also seems to have convergence issues for the High imbalance case due to having only one minority sample.

On the other hand, the figures reported for the Low and None imbalance levels almost outperform every result in the previous experiment, especially in the intra-domain scenarios. Clearly, in the case of a balanced or nearly balanced set of data, pre-training with data from a similar typology improves the results as it initializes the network with better parameters that will lead to better classification. Interestingly, even in inter-domain scenarios, the results, while slightly subdued, remain promising. This suggests that even with domain shifts, transfer learning can provide foundational knowledge that outpaces starting afresh or leveraging broader, less task-specific initializations like ImageNet.

Table 4 \(\text{ F}_{1}\) results obtained through the transfer learning technique. The initial column specifies the dataset used for model pre-training, the “From” column signifies the dataset used for fine-tuning, and the “To” column represents the dataset considered for evaluation. The intra-domain cases are underlined in each scenario. Each case is analyzed for four levels of imbalance: High, Medium, Low, and None

4.2 Data augmentation

Another approach to address the scarcity of labeled data is to apply transformations to generate synthetic images from the available samples. In this section, the results of this process are analyzed by applying the transformations described in Sect. 2.3, which include horizontal and vertical shifts, scaling, and rotations. The result obtained by increasing the \(\alpha \) factor with which they are applied is analyzed for each of these transformations. Specifically, the following set of values is considered: \(\alpha \in \{0, 1, 5, 10, 15\}\).

The graphs depicted in Fig. 3 show the results of these experiments for the four different imbalanced data distributions. In this case, we can see that the trend is of not improving the classification when data augmentation is applied. Data augmentation degrades performance in some cases, mainly in intra-domain and for high imbalance. This might be caused by the distinctiveness of medical X-ray images. Applying data augmentation includes non-realistic characteristics in the model, hardening the classification process.

Fig. 3
figure 3

Graph of data augmentation. Five levels of augmented percentage are shown, from \(0\%\) to \(15\%\), for the four different levels of data imbalance, High to None

4.3 Dealing with Imbalanced Data

This set of experiments addresses the data imbalance problem and analyzes the results obtained by applying the techniques proposed in Sect. 2.4. Table 5 shows these results, which are similarly arranged as the experiments before, with the training set in “From” and the evaluation set in “To” columns, respectively. In the table, three cases are evaluated: the weighted loss function (that gives more importance to the minority class, i.e., COVID-19+ cases), the balanced sampling technique by oversampling the minority class to have an equal number of data in the Siamese pairing during the training process, and the combination of both (columns “Bal. + W.Loss”).

The data in bold refer to the best performance per row and imbalance level. Focusing on the average values at the bottom of the table, we can see that the oversampling technique achieves the best results in High imbalanced cases. This makes sense as it compensates for the high imbalance by feeding the network with more minority samples. Nevertheless, for the rest of the cases, the average results of combining oversampling with the proposed weighted loss function provide the best classification.

Table 5 Comparison of the \(\text{ F}_{1}\) results obtained through the balancing techniques: weighted loss function, oversampling minority data, and the combination of weighted loss and oversampling. Results for the four data distributions considered, from High to None. The best results per line are marked in bold

Next, the fourth proposal to deal with imbalanced data is evaluated: the level of positive/negative data pairing (referring to pairs of equal or different images) during the training process of the Siamese network. Figure 4 presents the \(\text{ F}_{1}\) results for five pairing ratios and for the four data distributions, from High to None. Particularly, the Siamese network is trained with pairing proportions from five positives for every negative (5/1), then three positives for every two negatives (3/2), up until one positive for every five negatives (1/5). Note that in the case of High imbalance, where there is only one positive sample (COVID-19 infected patients) along with 100 negative (healthy) data, the 5/1 pairing will only have this same image for the negative pairs. Consequently, this image will be presented to the network in every batch, leading to overfitting. Therefore, the classification accuracy will drop when evaluated with varied positive data (other COVID cases). This phenomenon is further analyzed in the following paragraph.

From Fig. 4, we can observe that in the high imbalance scenario, the pairing hardly affects the performance, obtaining results around 50 of \(\text{ F}_{1}\) in all the pairing levels studied, which denotes the previously mentioned problem: the sparse positive data (COVID cases) in the dataset leads the network to overfit and underperform on the test set. However, in Low and None imbalance, the intra-domain \(\text{ F}_{1}\) is notably higher and improves as more negative pairs are presented to the network. This is because when more negative pairs (i.e., different) are fed to the Siamese network, it learns better features to distinguish the classes and, hence, classifies better.

As a summary of this approach to handling imbalanced data, we can conclude that adjusting the pairing level has no effect in situations with high imbalance. In similar distributions, pairing levels with a greater number of negative pairs seems beneficial for the Siamese training process.

Fig. 4
figure 4

Graph of pairing experimentation. Five different ratios of positive/negative pairs, and High to None data distribution cases

4.4 Inference classifier

This section focuses on analyzing the improvement provided by the final classifier used in the proposed pipeline. We have previously reported the results using the histogram method (see Sect. 2.5)—simply choosing the class that minimizes the distance—as it represents the commonly used approach. These results are now compared with those obtained using three alternative classifiers trained from the embeddings generated by the Siamese network, namely kNN, Rf, and SVM, and using the same distance metric as before, that is, the Euclidean distance. For each of these methods, the baseline parametrization of [44] was initially considered, although these hyperparameters have also been studied and tuned for this scenario, eventually selecting the best configurations, which include k values within the range \(k \in [1, 15]\), several tree estimators \(t \in [10, 500]\) for Rf, and Linear, Polynomial, and Radial Basis functions for the kernel of SVM with a learning cost \(c \in [1, 9]\).

Table 6 shows the outcomes of these experiments, comparing the performance of the four classifiers across inter- and intra-domain levels and for the four imbalanced distributions considered. A general analysis of these results shows that the SVM classifier reports an improvement in all scenarios except for high levels of imbalance, for which the use of the histogram-based or kNN-based approaches seems more advisable. If we analyze the results at the inter- and intra-domain levels, it is observed that SVM generates a model that generalizes better to other domains, while the solutions based on histogram and kNN are more effective within the same domain.

Table 6 Comparison of the \(\text{ F}_{1}\) results obtained by the different inference classifiers considered: Histogram, kNN, Rf, and SVM. The best result for each imbalanced scenario is marked in bold

4.5 Analysis of the training set size

In this section, the performance of the proposal is evaluated as the training set size increases. These results are also compared with those obtained by training a single backbone (the CNN ResNet-50 v2 architecture, which is also the one analyzed in the previous work [43]). Regarding the size of the training set, in addition to the data distributions with 100 samples for the majority class, which has been used in the previous experiments, the amount of data is increased to 200 and 300 samples following the same imbalanced distributions: High \(\rightarrow \{100/1, ~200/2, ~300/3\}\), Medium \(\rightarrow \{100/10, ~200/20, ~300/30\}\), Low \(\rightarrow \{100/50, ~200/100, ~300/150\}\), and None \(\rightarrow \{100/100, ~200/200, ~300/300\}\).

The results of these experiments are depicted in Fig. 5 for both the Siamese network and the CNN at the inter- and intra-domain levels. The first aspect to highlight is that in the case of High imbalance, the error is quite similar for both models, achieving a low \(\text{ F}_{1}\) performance and being the CNN the lowest in most cases. This shows that the two architectures have problems learning this highly imbalanced distribution.

Generally, the lower the imbalance, the better the results for the intra-domain scenarios. When studying the Low and None cases, we can see that intra-domain models are remarkably better, being the CNN slightly better in both cases. An additional observation from the graphs is that the Siamese network stabilizes earlier than the CNN.

From the information in the charts of Fig. 5, we can conclude that the Siamese network works better for High and Medium-imbalanced datasets. In contrast, using this network is not necessary in cases of balanced data. On the other hand, the fact that the inter-domain training processes maintain a low \(\text{ F}_{1}\) score regardless of the imbalance level demonstrates that the networks do not generalize properly.

Fig. 5
figure 5

Graph comparison of Siamese and CNN network architectures. The evaluation is carried out for three sizes of training sets, 100, 200, and 300 samples for the majority class, and High to None imbalance data distributions

4.6 Discussion

This last section summarizes the improvements provided by each technique studied for a few-shot learning scenario with imbalanced datasets. These results are compared with those obtained in the previous study [43] using equivalent techniques for imbalanced datasets but applied to a CNN when there is no limitation of labeled data. This comparison aims to shed light on whether these techniques are consistent in their results or, on the contrary, performance depends on the amount of information available or the network architecture.

Table 7 shows a summary of results for all the previous approaches and both inter- and intra-domain cases, indicating the percentage of improvement relative to the base case, which is the model trained from scratch as described in Sect. 4.1. For the sake of fair comparison, the percentages of improvement shown of the Siamese network are with Medium imbalance (100/10) since it represents the data distribution most similar to the original one studied in the previous work using a CNN. In the table, the CNN cases without results are marked with “–”, either because they were not considered in the previous study or because they do not apply to a CNN, such as the level of pairing.

From this general analysis, it can be observed that the various techniques studied offer promising improvements over the baseline in almost all cases. However, it appears that using one technique over another may be more advisable depending on whether the learning problem involves limited data or if there are no restrictions on labeled data. In the case of Few-shot learning, it seems more advisable to have a better initialization and use a classifier learned from the embeddings of the Siamese network during inference. However, with no data restrictions, using oversampling and weight loss proves to be much more beneficial. This may be because, in a Few-shot scenario with a single sample, if repeated many times or given a high weight, it generates overfitting towards the minority class, limiting its generalization capabilities.

The best technique to select also depends on the application domain. For instance, in few-shot scenarios, if the goal is to have better inter-domain generalization, the use of transfer learning and SVM is recommended. On the other hand, if the aim is to be more effective within the source domain, a general initialization—with ImageNet, which does not create a bias towards other distributions—is more appropriate, employing the proposal for oversampling combined with a weighted loss function. When there are no data restrictions, these conclusions change slightly. For example, in addition to weight loss and oversampling—which in this case are recommended to be used separately since they provide a more notable improvement—it is always advisable to initialize using transfer learning. This may be because having more data available for fine-tuning eliminates the risk of creating bias.

Table 7 Summary of the improvements obtained by each of the techniques proposed for the Siamese architecture (in the case of the inference classifier, only the two best are included). These results are compared to those obtained using equivalent techniques on a CNN in our previous work [43]

5 Conclusions

This study delves into the performance of various techniques in the challenging context of few-shot learning with imbalanced medical datasets. The results shed light on the intricate dynamics between the amount of data, distribution imbalance, and model architecture. While some of the studied techniques are well-established in the literature, others are not, such as the adaptation proposals to deal with imbalanced data. Besides, this work focuses on evaluating their effectiveness in the context of medical imaging and examining their performance when used in combination with Siamese architectures.

First, we focus on the initialization of network parameters for few-shot scenarios. The main conclusion is that pre-training the model using transfer learning, either with general data or with data from a similar domain, helps improve the generalization capabilities of the model in this challenging data-sparse scenario. Several data augmentation techniques have also been studied, concluding that applying standard transformations with medical imaging for few-shot scenarios is not a good practice due to the peculiarities of these data.

Furthermore, four approaches have been proposed to address data imbalance, including a weighted loss biased to the minority class, balancing the samples, and modifying the pairing ratio of positive and negative samples. The conclusions are that, in cases of high imbalance, balancing the samples by repeating the minority data helps improve the results. However, when the dataset is not highly imbalanced, combining a weighted loss with balanced data allows the network to learn better features. Different pairing ratios between the same and other classes in the Siamese training were also studied. In this case, when the datasets are balanced and the pairing ratio shifts towards more negative (different) pairs than positive, intra-domain results improve since this helps to learn features that distinguish between classes.

Regarding classification, the evaluation included four approaches: Histogram, kNN, Rf, and SVM. The SVM classifier is more accurate in all inter- and intra-domain scenarios except for high levels of imbalance. In high imbalance, using the histogram-based (for intra-domain) or kNN-based (for inter-domain) approaches reports better results.

Finally, we compared the Siamese network (with the different techniques introduced for dealing with few-show and imbalanced datasets) against a standard CNN network from previous works. We first studied the impact of the training set with other data distributions. The main conclusion of this experiment is that, in highly imbalanced situations, the performance of both Siamese and standard CNN is low, with the first slightly better. However, in balanced cases, the inter-domain training improves with the dataset size, whereas the inter-domain does not, showing limited generalization capabilities. Afterward, we compared the different initializations, data augmentation, and imbalance solutions for the CNN and the Siamese network. As expected, the general observation from this study validates the general intuition that the specific technique to be applied depends on the level of data available and the application domain.

For future work, these techniques could also be adapted and studied for matching, prototypical, and relation networks to compare them to the Siamese approach. In addition, alternative network architectures other than ResNet-50 could be evaluated. Data augmentation guided by experts for the medical domain could also be included, as well as additional datasets. Regarding initialization, alternative techniques, such as Self-Supervised Learning, could be evaluated for scenarios with data scarcity.