1 Introduction

Image Recognition and Classification constitutes the field within the Computer Vision one which aims at categorizing images out of a discrete set of labels [44]. While this task has been historically addressed with classic Pattern Recognition and Signal Processing techniques, the development of the Deep Learning (DL) paradigm led to a considerable renewal of the field with a significant improvement in performance rates [36]. Examples of the success of DL in image-related tasks can be found in such disparate areas like Optical Music Recognition [9], Medical Image Analysis [24] or Scene Recognition [53], among others.

Among the different advantages DL models depict, one of their most remarkable features is their inherent feature learning capabilities [4]: out of the raw input data, these neural architectures infer a suitable set of features, typically referred to as Embedded Spaces or Neural Codes [2], for tackling the task at hand [25]. For instance, in classification tasks, this mid-level space has been proved to provide higher discriminative capabilities than the use of hand-crafted features [57]. This particular capability allows DL architectures to be used as feature extractors for training alternative shallow learning models in a process known as transfer learning [38], which may further report additional advantages as improving their classification rates or allowing search and indexing tasks, among others [11, 18].

Nevertheless, the main counterpart of the supervised DL framework is the high amount of data required for training the model [30, 55]. Thus, in cases in which it is not possible to fulfill this constraint, these neural architectures are not capable of obtaining a representative embedded representation for the recognition task. In this regard, regularization strategies [12] and data augmentation processes [40] constitute two of the most common existing proposals to alleviate this issue [31].

While the aforementioned techniques somehow palliate this limitation, there exist cases of remarkable data scarcity in which the different classes of the corpus are depicted by a considerably limited number of examples. Those scenarios are known as Few-Shot Learning (FSL) [50] and, in extreme situations, they even require the prediction of labels not present in the training data. To tackle them, there exist particular neural architectures which are trained to estimate the similarity of the data [41], being the so-called Siamese Neural Networks (SNNs) [27], Matching Networks [49], Prototypical Networks [42], and Relation Networks [43] some of the most typical examples. This work focuses on the former SNN scheme, which will be thoroughly described in the rest of the work. For an exhaustive revision of alternative strategies, the reader if referred to the work by Jadon [22].

Originally proposed by Bromley et al. [7], SNNs consist of two identical Deep Neural Networks with shared weights that work in tandem on two different inputs for obtaining their corresponding embedded representation as in a feature learning task. Once the inputs are mapped to the target space, their similarity is measured using a certain distance metric. Given their particular training mechanism, the intermediate feature-based representation maps elements deemed as similar to closer positions while working oppositely with dissimilar input objects [2]. While these architectures comprise several neural layers for performing this space learning, the fact that weights are shared across these networks actually results in a fewer number of parameters to train the model, being thus fewer data required and less tendency to overfit. In addition, it must be highlighted that the pair-wise arrangement of the input data results, in practice, in a larger set size with a higher variability compared to the initial training set which also favors the convergence of the scheme. Also note that recent SNN proposals consider additional strategies to further reduce this number of parameters as, for instance, the use of attention mechanisms [51].

Since these SNN architectures were devised to measure the similarity of groups of elements, they have been traditionally applied to verification as, for instance, signature recognition [1, 23] or face authentication [45]. In contrast to this, recent works have considered the possibility of directly applying these schemes for classification by incorporating an additional layer once the initial datum is mapped to its embedded representation. Some examples of works that develop this concept are the one by Pan et al. [37] for text classification and that by Nanni et al. [35] for animal sound identification.

Within this context, this work revisits and further expands the use of Siamese neural schemes for few-shot image recognition tasks. For that, we initially study the representation capabilities of existing base Siamese architectures for then proposing additional mechanisms based on data augmentation—both for train and test elements—and transfer learning for improving the recognition results. Note that, while all the individual methods contemplated do not constitute a novelty by themselves, to our best knowledge no previous work has considered them in an integrated manner in Siamese schemes for addressing image-related FSL tasks. Also let us emphasize the relevance of the proposal within the wider image recognition field since, in some cases such as that of medical imaging data, it may not only be difficult to gather samples but also there may exist very scarce examples of a certain class (e.g., a rare pathology or an uncommon disease) to work with [33]. Oppositely, when tackling larger data collections, we acknowledge that the presented proposal may not be competitive against other alternatives that, for their part, may not converge when addressing an FSL scenario.

Considering the above, the precise contributions of this work are: (i) a thorough study and comparison of the capabilities of Siamese architectures for obtaining suitable embedded representations in scenarios with severe data scarcity; (ii) an empirical assessment of the goodness of train data augmentation for artificially generating new examples with the aim of improving the feature learning process; (iii) modeling the classification task as a transfer learning problem and studying the relevance of considering an adequate shallow classification method for performing this process; and (iv) the use of test data augmentation to mimic ensemble learning processes for improving the overall performance of the scheme. These studies are thoroughly assessed with a series of experiments comprising several image corpora, neural configurations, and classification methods.

The rest of the paper is structured as follows: Section 2 provides the theoretical background of the work; Section 3 describes the gist of Siamese Neural models; Section 4 presents the methodology devised for the study; Section 5 shows the corpora, metrics, and neural topologies considered for assessing the proposal; Section 6 presents and discusses the results obtained in the experimentation; finally, Section 7 concludes the work and poses future ideas to address.

2 Background

The high amount of data required for the convergence of DL-based neural models constitutes one of their major limitations [6]. This limitation becomes especially challenging when addressing Few-Shot Learning scenarios due to its inherent scarcity of samples [55]. To mitigate this issue, the literature comprises a wide range of functional solutions, namely regularization methods, which aim at increasing the generalization capabilities of these schemes directly at a model level. Shorten and Khoshgoftaar [40] summarize the most common procedures into four groups: (i) Dropout strategies, which consist of randomly deactivating neurons in the model during the training phase; (ii) Batch normalization approaches, which perform a standardization step to the input data of the different layers to stabilize the learning phase; (iii) Pretraining methods, which stand for the process of training a model on a sufficiently large corpus and then fine-tuning the network using a new target corpus with the same label space; and (iv) Transfer Learning approaches, which train a network with a large corpus, then copy part of the weights to a novel architecture, and eventually carry out a fine-tuning process using a new target corpus with a different label space and a shallow recognition method.

Besides these functional solutions, some works in the scientific literature have proposed neural schemes particularly devised for tackling the scarcity of training data, being Siamese Neural Networks [27] (SNNs) one of the most representative examples. These architectures, which are thoroughly described in Section 3 due to their relevance in this work, are capable of obtaining a suitable embedded representation out of the initial data for estimating the similarity degree between pairs of input data using a very reduced number of parameters.

Along with the commented architecture-related solutions, data augmentation is also considered one of the main approaches for addressing the data scarcity issue in neural-based learning schemes [46]. This procedure is based on the premise of creating additional artificial samples by performing controlled distortions on the initial training data for increasing the data variability and, hence, obtaining more robust and accurate models. It must be highlighted that, while no single distortion is proved to improve the robustness for every single image corpora and recognition task [32], some of the most common procedures when working with this type of data are rotation, translation, color inversion, scaling, and cropping, among others [13].

As commented, data augmentation aims at increasing the variability of data during the training phase of the neural model. Due to its reported benefits in classification schemes, this concept has been recently extended to be considered during the inference phase of the model under the name of test data augmentation [8]. This paradigm states that, given a certain query image to be categorized, it is possible to derive and classify different versions of that element obtained with standard augmentation procedures for then taking the mode of the individual predictions as the label of the initial query. This procedure somehow imitates an ensemble-based learning method as the class decision is taken according to the predictions of a set of individual non-independent classifiers. To the best of our knowledge, this paradigm has not been applied to image-based Few-Shot Learning works given its recentness.

The presented strategies allow tackling the limitations of deep neural schemes when addressing scenarios with a shortage of labeled data. However, to the best of our knowledge, no work has specifically addressed and analyzed the possible combination of such different mechanisms for improving the representation capabilities of Siamese-based classification schemes in the context of Few-Shot Learning for image recognition scenarios. In this regard, the present work initially studies different SNN architectures—described in the next section—, as well as the influence of their hyperparameters. Besides, based on the reported good results in the image recognition literature, the relation between the goodness of the embedded representation and the train and test data augmentation procedures is assessed. Finally, inspired by the work of Das and Lee [14], our proposal also examines the applicability of transfer learning for improving the classification results by copying the weights obtained by the base Siamese schemes to an alternative architecture in which the neural model act as feature extractors and shallow classifiers perform the recognition task.

3 Siamese neural networks for few-shot learning

This section formally presents the gist of Siamese Neural Networks to the reader as it constitutes the base of this work. Besides, different existing extensions to the base architecture are introduced for comparatively assessing their performance in the experimentation of the work. Note that, as aforementioned, the key part of these architectures is the embedded representation obtained by the inner neural models, whose weights are adjusted during the training phase of the Siamese scheme.

For the sake of clarity, let \(\mathcal {T} = \left \{\left (I_{i}, c_{i}\right ): I_{i}\in \mathcal {I}, c_{i}\in \mathcal {C}\right \}_{i=1}^{\left |\mathcal {I}\right |}\) denote a set of labeled elements where \(\mathcal {I}\) represents a source data space and \(\mathcal {C}\) the set of possible categories or classes. Let also \(\zeta : \mathcal {I}\rightarrow \mathcal {C}\) be the function which relates datum Ii with its associated class ci, i.e., \(\zeta \left (I_{i}\right ) = c_{i}\).

Considering two elements Ia,Ib from the defined input space \(\mathcal {I}\), the Siamese architecture initially maps them using two identical Deep Neural Networks with shared weights w to a real-value N-dimensional space \(\mathcal {X} \in \mathbb {R}^{N}\) as feature vectors xa and xb, respectively. In this new space, given a dissimilarity metric \(d : \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}^{+}_{0}\), a similitude score Dw between the elements may be retrieved. Ideally, this figure is meant to be zero when the input data are equivalent, drifting away from this value proportionally to the dissimilarity degree of the input elements. Note that this Dw value may be thresholded (either heuristically or with a learning-based method) for categorically establishing whether the inputs are similar or not. Fig. 1 shows a graphical example of this neural model.

Fig. 1
figure 1

Representation of a Siamese Neural Network architecture. Input elements Ia and Ib are mapped to vectors xa and xb, respectively, with the Deep Neural Networks of shared weights w. The similitude score Dw is obtained using the dissimilarity metric d which is eventually used for computing the contrastive loss

This architecture is trained using the so-called contrastive loss [19] which, for a single pair of data \(\left (I_{a}, I_{b}\right )\), is defined as:

$$ L\left( w,\left( y,I_{a},I_{b}\right)\right) = (1-y)\cdot L_{S}\left( D_{w}\right) + y\cdot L_{D}\left( D_{w}\right) $$
(1)

where \(y\in \left \{0,1\right \}\) is a binary indicator depicting whether the input pair \(\left (I_{a}, I_{b}\right )\) is deemed similar or different, and elements \(L_{S}\left (\cdot \right )\) and \(L_{D}\left (\cdot \right )\) represent the partial losses for similar and dissimilar objects, respectively. Note that partial loss \(L_{D}\left (\cdot \right )\) actually represents a hinge or maximum margin loss in which a margin parameter m must be optimized during the training stage. Generalized to a \(\mathcal {P}\) set comprising pairs of both similar and dissimilar data, the entire loss \({\mathscr{L}}_{S}\left (w\right )\) is obtained as the sum of the partial losses for each pair in \(\mathcal {P}\), i.e., \({\mathscr{L}}_{S}\left (w\right )={\sum }_{i=1}^{|\mathcal {P}|}L\left (w,\left (y,I_{a},I_{b}\right )^{(i)}\right )\).

It must be also pointed out that this pair-wise arrangement of set \(\mathcal {T}\) increases, in practice, the number of samples used for training the model to \(\binom {|\mathcal {T}|}{2}\) elements. Moreover, this disposition also results in higher variability of the data at input which favors the convergence of the neural scheme.

Hoffer and Ailon [20] extended the Siamese architecture by increasing the number of input branches, obtaining the so-called Triplet Network, which is shown in Fig. 2. This model is fed with three samples of data: a reference element Ir, known as anchor, and two other data Ip and In, which respectively represent similar and dissimilar elements to anchor Ir. Note that, as in the Siamese model, these elements are embedded to N-dimensional space \(\mathcal {X}\) as feature vectors xr, xp, and xn with a stack of Deep Neural Network models with shared weights. Furthermore, considering a set \(\mathcal {J}\) of data arranged in triplets as previously described, the architecture is trained using the triplet loss [39], which is defined as:

$$ \mathcal{L}_{T}\left( w\right) = \sum\limits_{i=1}^{|\mathcal{J}|}{\max(0, D_{p} - D_{n} + m)^{(i)}} $$
(2)

where \(D_{p} = d\left (\mathbf {x}_{r},\mathbf {x}_{p}\right )\) and \(D_{n} = d\left (\mathbf {x}_{r},\mathbf {x}_{n}\right )\) denote the i-th group distances, respectively, and m stands for the separation margin.

Fig. 2
figure 2

Representation of a Triplet Neural Network architecture. Elements Ip,Ir,In are mapped to vectors xp, xr, and xn, respectively, with the Deep Neural Network of shared weights w. Similitude scores Dp and Dn are obtained using the dissimilarity metric d for eventually computing the triplet loss

Bell and Bala [3] extended these architectures by addressing the process as a multi-task problem [10]. Their proposal states that including the categorical predictions of the N-dimensional feature vectors in the general loss may further improve the mapping capabilities of the models. In this regard, the embedded representations obtained by the Deep Neural Network models are additionally connected to a fully-connected network of \(|\mathcal {C}|\) neurons with a softmax activation, thus contributing with a certain weight 𝜃 to the overall loss. Fig. 3 shows a graphical example of this extension applied to a base Siamese architecture.

Fig. 3
figure 3

Representation of a Siamese Neural Network architecture including categorical predictions. Elements Ia and Ib are mapped to vectors xa and xb, respectively, with the Deep Neural Network of shared weights w. These vectors are classified using a fully-connected network of \(|\mathcal {C}|\) units with a softmax activation, obtaining predicted classes \(\hat {c}_{a}\) and \(\hat {c}_{b}\) which are used for computing the overall loss together with the similitude score Dw

Mathematically, this extension of the base Siamese architecture is trained using the following loss function:

$$ \mathcal{L}_{SC}\left( w\right)=\sum\limits_{i=1}^{|\mathcal{P}|}{L\left( w,\left( y,I_{a},I_{b}\right)^{(i)}\right) + \theta\left[L_{C}\left( \hat{c}_{a}, \zeta\left( I_{a}\right)\right)^{(i)} + L_{C}\left( \hat{c}_{b}, \zeta\left( I_{b}\right)\right)^{(i)}\right]} $$
(3)

where \(L_{c}\left (\cdot , \cdot \right )\) represents the categorical cross-entropy loss of the input elements weighted by value \(\theta \in \mathbb {R}\), and \(\hat {c}_{a}\) and \(\hat {c}_{b}\) the estimated categories in each branch.

Note that, since these schemes are not meant to retrieve a class but a similitude score which relates the embedded representations of the input elements, they are not directly applicable for classification tasks. Thus, as a base case, the following procedure is considered to adapt Siamese schemes for classification purposes: for a given query Iq, all the distances between this element and the entire train set \(\mathcal {T}\) in the embedded representation are computed; eventually, query Iq is given label \(\hat {c}_{q}\) which corresponds to that of the element that minimizes the distance value. Mathematically, this may be expressed as:

$$ \hat{c}_{q} = \zeta\left( \underset{\forall I_{i}\in\mathcal{T}}{\arg\min} d\left( \text{DNN}(I_{q}), \text{DNN}(I_{i})\right)\right) $$
(4)

where \(\text {DNN} : \mathcal {I} \rightarrow \mathcal {X}\) denotes the function which obtains the embedded representation of the argument provided and \(d:\mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}_{0}^{+}\) stands for a given dissimilarity metric.

Within this context, this work assesses the representation capabilities of the introduced Siamese architectures and their extensions when hybridized with transfer learning and data augmentation for few-shot image classification. More precisely, we study the extent up to which the embedded spaces generated by Siamese architectures manage the inherent data scarcity of an FSL scenario and how the posed additional mechanisms may improve the overall recognition rates.

4 Methodology

Figure 4 shows the experimental scheme proposed. As aforementioned, this work assesses the representation capabilities of the embedded spaces obtained by Siamese Neural schemes, aided by data augmentation processes, and its application to image classification tasks using transfer learning strategies and ensemble-based recognition. For that, the pipeline comprises several processing blocks which implement these processes individually for assessing their relevance in the overall scheme.

Fig. 4
figure 4

Graphical representation of the proposal. During the training phase, the data arrang. block arranges the reference augmented data \(\mathcal {T}'\) according to whether a Siamese or a Triplet architecture is used in SNN for then, upon convergence, retrieving the NC representations. During the inference phase, query Iq is initially augmented for obtaining set \(I_{\mathcal {Q}}\) of size M + 1; after that, the CNN feature extraction retrieved from the SNN block is used for mapping \(I_{\mathcal {Q}}\) to \(x_{\mathcal {Q}}\); the Classifier stage retrieves set \(C_{\mathcal {Q}}\) with single class estimation for each input element for eventually obtaining prediction \(\hat {c}_{q}\) using the mode operator

During the training phase, the set of images \(\mathcal {T}\) undergoes a data augmentation process for increasing the variability at the input, retrieving collection \(\mathcal {T}^{\prime }\), whose size depends on the number of image alterations considered. This set is arranged in the data arrang. block into either pairs or triplets, depending on whether a Siamese or a Triplet Neural Network architecture is later considered, as described in Section 3. These arrangements are used in the SNN block for training either the Siamese or Triplet Neural Network architecture considered. Finally, once the network converges, the embedded representation of the data, here denoted as NC from the acronym of Neural Codes, is obtained using the feature extraction Deep Neural Network of the Siamese/Triplet scheme. Note that, since this work focuses on image data, the rest of the manuscript leaves out the generic Deep Neural Network in favor of the Convolutional Neural Network or CNN terms. Based on the same consideration, the space of our input data may be defined as \(\mathcal {I}\in \mathbb {R}^{f\times w\times h}\), where f, w, and h stand for the number of channels, width, and height of the images, respectively.

At inference, query image \(I_{q}\in \mathcal {I}\) is artificially augmented with the same processes as in the training stage, retrieving set \(I_{\mathcal {Q}} = [I_{1}, \ldots , I_{M+1}]\) of augmented images samples, where M stands for the number of generated images and \(I_{i}\in \mathcal {I}\) with 1 ≤ iM + 1. This set is then mapped to the embedded representation using the same CNN as in the training phase, which results in the collection of embedded representations \(x_{\mathcal {Q}} = [x_{1}, \ldots , x_{M+1}]\), where \(x_{i}\in \mathcal {X}\). At this stage, considering a particular classifier, set \(x_{\mathcal {Q}}\) is processed to obtain the group \(\hat {C}_{\mathcal {Q}} = [\hat {c}_{1}, \ldots , \hat {c}_{M+1}] s.t. c_{i}\in \mathcal {C}\) of individual labels. The mode operator is eventually applied to this \(\hat {C}_{\mathcal {Q}}\) set of predictions to estimate class \(\hat {c}_{q}\). Note that, in the experimental section of the work different alternative classifiers will be assessed during the experimentation section for quantifying the influence of this procedure.

Finally, it must be remarked that the particular neural architectures, as well as the parameters considered in each of the blocks, are discussed later, as they are rather related to the experimental part of the work.

5 Experimentation setup

This section presents the experimental details of the proposed study. More precisely, the details concerning the image corpora examined, the data augmentation procedures, the evaluation protocol, the neural topologies, and the loss functions corresponding to the neural Siamese architectures are introduced.

5.1 Corpora

Four image corpora were considered for this study: the MNIST (Modified National Institute of Standards and Technology) [29] and the USPS (United States Postal Service) [21] collections of grayscale images of isolated handwritten symbols, the Fashion MNIST [52] collection of grayscale images of Zalando articles, and the CIFAR-10 [28] corpus by the Canadian Institute For Advanced Research of color images devised for object recognition tasks. For all these corpora we considered the original train and test splits defined in the respective reference works. Table 1 provides a detailed description of these corpora in terms of the number of instances per data partition, classes, and image characteristics.

Table 1 Summary of the corpora considered

Note that, since the size of the training partitions in these corpora does not match the requirements of a Few-Shot Learning scenario, we artificially reduce their size while leaving the test divisions unaltered. This reduction is performed by randomly selecting T samples (experimentation parameter) of each class, i.e., \( \mathcal {T}_{R} = \cup _{j=1}^{|\mathcal {C}|}\mathcal {S}_{j} \) where \( \mathcal {S}_{j} \in _{R} \left \{(I_{i}, c_{i})\in \mathcal {T} : c_{i} = c_{j}\right \} \) and \(|\mathcal {S}_{j}| = T\). Besides, all images were scaled to the range [0,1] by dividing by 255 to facilitate the convergence of the neural models.

Regarding the evaluation protocol, this work resorts to classification accuracy as the figure of merit since the corpora considered for the experimentation depict balanced distributions regarding class representation.

5.2 Data augmentation procedures

As commonly considered in the DL literature, we have contemplated different data augmentation procedures to improve the overall performance of the scheme. However, and as previously stated, this work considers such augmentation strategies for both improving the generalization capabilities of the model by the variability of the train data and also to mimic an ensemble-based approach by creating different test samples from the query one.

For both cases, the specific image-based augmentation processes considered are based on those used in the work by Calvo-Zaragoza et al. [8]. The actual procedures were: (i) rotations in the range \(\left [-20^{\circ },20^{\circ }\right ]\); (ii) zoom in the range \(\left [0.5, 1.5\right ]\) times the size of the image; and (iii) horizontal and/or vertical translations in the range \(\left [-10, 10\right ]\%\) with respect to each dimension of the image. The precise parameter values used in our work were selected by performing an initial exploration considering the corpora contemplated in this study.

It must be noted that, as it will be shown in the experimentation, the considered image augmentations procedures prove to be useful for our task. However, future research efforts contemplate the exploration of alternative augmentation procedures since this point has not been thoroughly assessed in this work.

5.3 Siamese Neural architectures

In terms of the neural models, Table 2 describes the particular CNN architectures considered for obtaining an appropriate embedded representation—no final dense layer for classification purposes is included—out of the input data for each of the considered corpora. The design rationale behind these schemes was that of devising a set of simple architectures capable of adequately addressing the corpora at hand without necessarily achieving state-of-the-art results. Also note that the relative shallowness of these models favors their convergence with a reduced amount of data compared to other deeper alternatives typically considered in the general image processing field.Footnote 1

Table 2 CNN network configurations considered

These schemes have been proved to retrieve suitable embedded representations of the different corpora considered in cases with plenty of training data. In this regard, since these works aim at studying and tackling Few-Short Learning scenarios, the use of Siamese-based architectures is considered for training those CNN models in cases of data scarcity.

Regarding the experimental details of the Siamese architectures, which correspond to the SNN block in Fig. 4, the Euclidean distance is considered as dissimilarity function d and the 2 normalization [56] for the regularization of the embedded representations. In addition, this work proposes and compares three particular implementations of the contrastive loss function introduced in Eq. (1) for training the schemes, which are:

$$ \begin{array}{@{}rcl@{}} L_{a} &=& \left( 1-y\right) \cdot {D^{2}_{w}} + y \cdot \max\left( 0,m-D_{w}\right)^{2} \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} L_{b} &=& \left( 1-y\right) \cdot {D^{2}_{w}} + y \cdot \max\left( 0,m-{D_{w}^{2}}\right) \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} L_{c} &=& - \left[\left( 1-y\right) \cdot \log\left( 1-\frac{D_{w}}{N}\right) + y \cdot \log\left( 1-\frac{N - D_{w}}{N}\right) \right] \end{array} $$
(7)

where N represents the dimensionality of the embedded space, Dw stands for the dissimilarity value between input elements, i.e., \(D_{w} = d\left (x_{a},x_{b}\right )\), m for the separation margin, and y for the binary class-matching indicator. Note that all these loss functions follow the premises posed by Hadsell et al. [19] about the design of such methods.

Finally, all models are trained for 200 epochs using stochastic gradient descent and a mini-batch size of 32 samples. No optimization method is specified at this point since it constitutes one of the parameters to be evaluated during the experimentation phase of the work.

6 Results

This section presents the results obtained for the proposed methodology with the considered experimental setup. For an easier understanding, considering a base Siamese architecture, initial experimentation is carried out to optimize the different hyperparameters of the scheme. After that, the different extensions to the standard Siamese scheme introduced in Section 3 are evaluated with the aim of optimizing the neural architecture. Then, the capabilities of transfer learning are assessed and compared to the different baseline cases. Later, the use of test data augmentations for mimicking an ensemble classification as well as the influence of train data augmentation for obtaining a suitable embedded representation are evaluated. Next, the influence of the initial training set size on the overall success of the recognition process is studied. Subsequently, the improvement of the proposal against the base cases is analyzed in terms of statistical significance. Finally, a summary of the most remarkable insights gained from the experiments performed is provided.

6.1 Siamese network optimization

This part of the work describes the experimental process followed for obtaining the proper architecture and parameters of the Siamese scheme. Note that, since the target metric at this point is the classification rate, the procedure presented at the end of Section 3 which allows using these schemes in classification tasks is considered.

6.1.1 Hyperparameters evaluation

An initial point to tackle is the adjustment of the different hyperparameters of the architecture. For that, a standard Siamese Neural scheme is considered with the corresponding CNN architecture for each corpus (see Table 2) with the following initial conditions: 100 training samples per class, no data augmentation, the Stochastic Gradient Descent (SGD) method [34] as optimizer for the training phase optimization, an embedded space size of N = 256, a positive-negative pairing proportion of 1 − 4, a cumulative \({\mathscr{L}}\left (w\right )\) based on La from Eq. (1), and a margin m = 1 for the \(L_{D}\left (\cdot \right )\) hinge loss. Starting with these initial conditions, each hyperparameter is individually adjusted while fixing the others.

For the sake of compactness, the results of the optimization of all the hyperparameters are shown in Fig. 5 while the related analysis and discussion are later introduced.

Fig. 5
figure 5

Hyperparameter optimization of the proposed Few-Shot Learning classification scheme considering the base Siamese Neural Network architecture

The influence of the optimization algorithm for the training stage is analyzed by comparing the following strategies: the Adaptive Learning Rate Method (Adadelta) [54], the Adam method [26], the Adam method extended with Nesterov momentum (Nadam) [16], the Root Mean Square Propagation (RMSProp) [47], and the classic Stochastic Gradient Descent (SGD) strategy [34]. The results shown in Fig. 5(a) report that the Adam approach achieves the best overall classification rate, improving the results obtained with the Nadam and RMSProp methods by, approximately, 0.2%. Also note that the SGD optimizer depicts the least competitive performance as it roughly achieves a 57% in the considered figure of merit when averaging the individual results for all corpora.

Regarding the embedded space size N, Fig. 5(b) shows that sizes (up to 128) report poor classification rates. As this value is increased, the overall accuracy improves, getting to a plateau around size 512. Thus, the rest of the experiments consider this latter value.

With respect to the input data pairing proportion, Fig. 5(c) shows the influence of this parameter, being the x-axis the ratio between the number of same-class and different-class pairs. As it may be checked, results generally improve when more negative than positive pairs are considered, being a plateau reached with a proportion of 5 negative pairs per 1 positive. This proportion is selected for the rest of the experiments.

Figure 5(d) shows the results obtained for each of the contrastive loss functions described in Section 4. As it may be checked, functions La and Lb, which achieve classification rates around 80% on average for all corpora, clearly outperform Lc, which depicts an average performance figure of roughly 48%. In this regard, the rest of the work considers the La loss since, while marginally improving the Lb one, it does show the best overall results.

Having fixed the contrastive loss function, the m margin of the hinge loss LD component must be adjusted. For that, the margin parameter is varied in the range \(m\in \left [0.5,8\right ]\) for studying its relevance in the classification performance. The results obtained, which are graphically shown in Fig. 5(e), report that performance is penalized at both ends of the considered range of values, being the best performance when \(m\in \left [0.5,2\right ]\). Hence, the value of m = 1 was selected for the rest of the experimentation.

As the last point to address, the impact of train data augmentation on the classification performance of the scheme is studied. Figure 5(f) shows the results obtained when these transformation processes are randomly applied to the training data batches. As shown, the average performance improves in, approximately, 2.5% when considering an augmentation of 6, being no other remarkable benefits observed when this value is further increased.

Regarding the individual performance rates achieved by the different corpora considered, it may be observed that the MNIST and USPS collections generally report similar classification figures whereas in the Fashion set these results typically decrease around 10%. The CIFAR-10 corpus, as shown, achieves the lowest classification rates, somehow suggesting an inherent superior difficulty of this collection compared to the other three.

Finally, the result of this hyperparameter optimization process is an average classification rate of 80.86% for the base Siamese architecture and corpora studied, which constitutes a competitive figure considering the limited amount of data used in each corpus. It must be pointed out the relevance of this stage since, when not adequately configuring the neural scheme, results may decrease to remarkably low-performance values such as those of 48% when not selecting an adequate loss function (cf., Fig. 5(d)) or 57% when not using the proper optimizer (cf., Fig. 5(a)).

6.1.2 Neural topologies

Having optimized the hyperparameters of the scheme, the problem is now formulated as a multi-task scenario by incorporating the class predictions to each of the branches of the architecture. As commented in Section 3, this architecture combines the contrastive loss obtained by the Siamese scheme with the individual losses obtained by directly classifying the embedded representation of the input elements, being the latter weighted by a certain value 𝜃 called class loss contribution. In this regard, Table 3 shows the classification results obtained with the multi-task Siamese architecture for the different corpora considered when changing the 𝜃 class weight.

Table 3 Influence analysis of the class contribution to the overall contrastive loss \({\mathscr{L}}(w)\) in terms of accuracy

As it can be observed, the inclusion of this loss within the scheme reports an improvement with respect to the base figure of 80.86% obtained in the previous section given that all weight cases improve that result. However, it can be also noticed that the precise tuning of this parameter has a limited influence on the overall performance of the scheme since the difference between the worst and best cases, in absolute accuracy terms, is around 1%. This slight improvement is clearly due to the fact that, while the CIFAR-10 and USPS corpora increase their recognition rate, the performance with the other two datasets is not remarkably altered. For the rest of the experiments, this weight parameter is set to 𝜃 = 0.25 as it constitutes the global optimum observed in the experiments, i.e., an average classification rate of 82.12%.

Finally, the performance of the Triplet schemes presented in Section 3 is also assessed and compared to the Siamese approach. For a fair comparison, the same set of optimized hyperparameters from Section 6.1.1 is considered for all configurations. Note that the use of the multi-task learning approach is also evaluated by including the categorical classification as part of the loss. The results obtained are shown in Table 4, where Plain and Categorical loss depict the cases in which the multi-task is obviated and used, respectively.

Table 4 Accuracy comparison of the considered Siamese and Triplet architectures

As it may be checked, the results obtained by the Siamese methods are consistently better than the ones by the Triplet schemes for all the configurations and corpora considered. In this regard note that, despite the lower number of data combinations available to train the Siamese scheme compared to the Triplet one, the relative simplicity of the former architecture—Siamese methods only require comparing two elements at a time in contrast to the case of three elements of Triplet schemes—as well as their easier configuration process are most likely responsible for their superior performance.

Focusing on the Categorical loss case, it may be observed that only the Siamese scheme benefits from this additional term in the loss compared to the Plain one whereas the Triplet architecture remains practically unaltered in both scenarios. A possible reason for this to happen is that this strategy was originally devised for Siamese dispositions, hence suggesting that further research is required to adequately integrate and exploit this additional information in the loss computation.

A significance analysis has been performed to statistically evaluate the results obtained. For that, we have considered the Wilcoxon signed-rank test [15] to assess whether the improvements observed in the previous analyses may be deemed as significant. The individual samples of the test are retrieved by computing 10 different times the previous experiments for each corpus, model, and scenario. Note that the reduced training set \(\mathcal {T}_{R}\) in each of these 10 iteration is obtained by randomly sampling the initial \(\mathcal {T}\) partition, being then evaluated on the same static test set of the corresponding corpus. Figure 6 shows the results of the analyses related to the class loss weights (Fig. 6(a)) and the neural architecture (Fig. 6(b)) experiments.

Fig. 6
figure 6

Statistical significance analysis of the improvement obtained for the class loss weights (Fig. 6a) and the neural architecture (Fig. 6b). Yellow and green dots denote whether the case in the row significantly outperforms that in the column with significance thresholds of ρ < 0.01 and ρ < 0.05, respectively

As it may be checked, the individual statistical analyses confirm the previous statements: on the one hand, class weight 𝜃 = 0.25 outperforms the rest of the studied weights, being the case of 𝜃 = 0.5 the second-best option as it also outperforms all but the former case; on the other hand, it is also confirmed that the Siamese architecture with the Categorical loss significantly improves over the rest of the cases. Hence, out of this experimental comparative, it can be concluded that the best overall result is achieved by the Siamese scheme with the categorical loss contribution weighted with 𝜃 = 0.25, being thus this particular configuration the one used for the rest of the experiments.

6.2 Transfer learning

This section studies the possible improvement that the use of transfer learning may report to the overall classification rate. The premise, in this case, is to assess the performance of the embedded representation obtained with the CNN models trained with the Siamese configurations using alternative shallow classifiers.

In this regard, three different classification methods are considered for performing this transfer learning evaluation:

  • k-Nearest Neighbor (k NN) [17]: This algorithm classifies a given query Iq by retrieving the most common label among the k closest elements to it given a certain dissimilarity metric. For these experiments, the Euclidean distance is used as dissimilarity metric and the influence of parameter k is assessed in the range \(k\in \left [1,25\right ]\).

  • Support Vector Machine (SVM) [48]: This method maps the initial data to a higher-dimensional space with a given kernel function for then learning a hyperplane that separates the different classes at issue. Three different kernel functions are considered in the experimentation: a Linear function, a Polynomial one, and Radial Basis Function (RBF) with a learning cost c ∈ [1,9] (estimated in preliminary experiments).

  • Random Forest (RaF) [5]: This approach builds an ensemble classifier based on single decision trees trained with random subsets of the input data and takes as final output the combination of the individual decisions of each tree. The number of trees was evaluated in the range t ∈ [10,500].

In addition to these transfer learning approaches, two baseline schemes are considered to comparatively assess the results: the Siamese scheme with the categorical loss contribution obtained in the previous section, and the use of the neural convolutional architectures described in Table 2 trained as stand-alone classifiers. In the latter case, a fully-connected layer is added with \(|\mathcal {C}|\) neurons corresponding to the number of categories and a softmax activation, without considering any Siamese or Triplet architecture for their convergence. The results obtained are reported in Table 5.

Table 5 Classification accuracy of the different transfer learning proposals

A first point to remark is that the stand-alone CNN achieves the lowest classification rate with an average value of 77.56%. This constitutes a rather expected result since such an approach is not meant to work in Few-Shot Learning scenarios as it requires a large amount of data to properly converge.

The SNN case remarkably improves the use of the stand-alone CNN architecture, up to scoring an average classification rate of 82.12%. Note that, since the only difference between this case and the previous one is the use of the Siamese architecture, the figures obtained support the fact that this particular training procedure palliates the limitations of the stand-alone CNN approach when considering scarce amounts of data.

Regarding the transfer learning schemes, it can be checked that, with the sole exception of the Fashion corpus, all proposed configurations improve the baseline models. More in-depth, the k NN classifier is the scheme that achieves the best individual scores in three of the studied corpora, only tying with the case of SVM in the CIFAR-10 set. Moreover, the best average results are achieved by two particular configurations: the k NN classifier with k = 15 and SVM with the RBF kernel.

As a last note, the Wilcoxon signed-rank test has been contemplated to assess the statistical significance of the results obtained. As in the previous section, the individual samples used in the test are retrieved by computing 10 different times the presented experiments for each corpus, model, and scenario—with a distinct random subsampling of the initial training corpus in each iteration—, being the results shown in Fig. 7.

Fig. 7
figure 7

Statistical significance analysis of the improvement achieved considering the use of transfer learning procedures. Yellow and green dots denote whether the case in the row significantly outperforms that in the column with significance thresholds of ρ < 0.01 and ρ < 0.05, respectively

Attending to this analysis, it may be checked that all transfer learning approaches significantly outperform both baselines considered, the CNN and SNN approaches. Furthermore, it is confirmed that the k NN—when set to k = 15—together with the SVM—with the RBF kernel—stand as the most competitive alternatives as they outperform the rest of techniques.

For all the above, the presented results prove the validity and usefulness of the transfer learning approach in terms of improving the classification rate of the base schemes considered. Besides, it must be also pointed out that an additional advantage of this paradigm is that it extends the use of these neural schemes to other tasks different to the classification one as, for instance, search and indexing with the k NN algorithm [11, 18].

6.3 Test data augmentation for ensemble-based learning

Having assessed the validity of the transfer learning stage in the overall classification, this section evaluates the use of test data augmentation as a means of mimicking ensemble-based with the goal of improving the recognition rate. The idea, in this case, is to obtain different versions of a given query element Iq by applying standard image data augmentation procedures, classifying them individually, and then retrieving the estimated class \(\hat {c}_{q}\) as the mode of the individual predictions.

Figure 8 shows the results of the test augmentation method for the best performing transfer learning configurations from the previous section—denoted as k NN, RaF, and SVM—as well as the SNN architecture. Additionally, the case in which k NN is directly applied to the embedded representations without any initial train data augmentation procedure is included for comparative purposes. The stand-alone CNN configuration is omitted in this comparison due to its limited performance compared to the rest of the alternatives. For all cases, the test data augmentation procedures are the same as the ones described in Section 6.1.1 when studying the effect of train data augmentation.

Fig. 8
figure 8

Evaluation of test data augmentation. The x-axis represents the number of generated images from an initial one

As it may be checked, the ensemble learning process by the test data augmentation generally leads to an improvement in the classification results. The sole exception to this assertion is the case when going from the initial case of no test data augmentation (x = 1) to the case of only adding a single additional image (x = 2). This effect is most likely due to the fact that the distortions introduced severely alter the input image, thus confusing the recognition model. The further inclusion of test elements not only palliates this issue but also, as commented, reports an increase in the accuracy until a certain point in which this improvement stagnates.

In general, the best results are obtained in the transfer learning scenarios. Particularly, the k NN classifier reports the best performance, achieving the SVM and RaF strategies slightly worse recognition rates. In addition to this boost in performance, and as aforementioned, transfer learning extends the use of these embedded representations to other tasks besides classification as, for instance, search and indexing.

Regarding the case of the Siamese topology without transfer learning (the SNN case), it can be observed that the test augmentation strategy also reports a slight improvement in the overall performance of the scheme, being in this case around 0.5%. This fact proves that, while the test augmentation generally improves the recognition rates, the best performance is achieved when combining this mechanism with the transfer learning one.

In the case of the k NN-based transfer learning strategy with no train data augmentation, it can be observed that this ensemble learning strategy reports a remarkable improvement. However, the overall performance achieved is considerably lower than that of the schemes which consider train data augmentation. In this regard, it may be concluded that none of these data augmentation strategies may substitute the one other as they are proved to be complementary.

Overall, this ensemble learning strategy based on test data augmentation achieves an average improvement of 0.87% in terms of classification accuracy. The best overall result is obtained when considering the transfer learning approach with the k NN classifier and with 10 test data augmentations, which achieves an accuracy of 84.07%. This value improves in 2.13% the case of the k NN with no train data augmentation and 6.51% the stand-alone CNN network (see Table 5).

In addition to this experimentation, the Wilcoxon signed-rank test has been considered to assess the statistical significance in the results. As in the previous cases, the individual samples used in the test are retrieved by computing 10 different times the presented experiments for each corpus, model, and scenario. The results obtained for the case of 10 augmentations—the one in which the best improvement is achieved—are shown in Fig. 9.

Fig. 9
figure 9

Statistical significance analysis of the improvement achieved considering the test data augmentation procedure with 10. Green dots denote whether the case in the row significantly outperforms that in the column with a significance threshold of ρ < 0.05

The figures obtained state that the use of the test augmentation procedure significantly improves the cases in which this process is not considered. The sole exception to this assertion is the case when considering the standard SNN, in which no statistical difference in the performance is observed. In addition to this, it is also confirmed that the k NN method with test augmentation reports the best overall performance as it significantly outperforms all other alternatives, when both considering as well as disregarding the test augmentation procedure.

6.4 Analysis of the training set size

This section studies the influence of the training set size on the overall performance of the scheme. More precisely, the experiment compares the performance of the stand-alone CNN approach against the Siamese architectures by varying the number of train elements. Note that, in the case of Siamese schemes, both transfer learning strategies as well as the architecture directly meant for classification tasks are contemplated. The results of this experiment considering a train set size ranging from 1 to 500 samples per class are reported in Table 6.

Table 6 Evaluation of the influence of the train set size (samples per class) on the overall performance for each classification case considered

As it can be observed, when considering rather small training sizes, the CNN scheme is not capable of converging. In these same situations, any of the cases based on a Siamese architecture proves to be quite competitive, getting to accuracy figures ranging from 69% to 78% with just 5 to 20 prototypes per class. After that point, all schemes converge properly, but still the Siamese-based proposals outperform the CNN ones.

Furthermore, it may be checked that the best overall results are always achieved by a transfer learning scheme, most commonly the k NN classifier and, occasionally, the SVM one. Note that these approaches achieve an improvement that ranges from 4% to 9% with respect to the CNN case, considering only the cases when the latter approach converges.

Finally, note that transfer learning does also consistently improve the results achieved by the direct estimation of the class using the Siamese network, denoted as SNN in the table.

6.5 Discussion

Once the different experiments have been performed and individually analyzed, this section provides an additional discussion that summarizes the general insights obtained.

The first point to comment on is the clear advantage of the Siamese architectures when addressing Few-Shot Learning scenarios with respect to standard neural approaches. While this claim is not a novelty of our work, the presented experimentation validates, in terms of statistical significance, this statement for a variety of corpora and configurations. Furthermore, it is also shown that the use of the categorical loss as an additional input to the contrastive loss of the Siamese schemes significantly boosts the classification rate of the scheme.

In relation to this point, it must be highlighted that the optimization of the different hyperparameters of the Siamese scheme plays a key role in the success of the task. As it has been assessed, parameters such as the use of train data augmentation, the particular definition of the contrastive loss, the weights optimizer, or the ratio of positive and negative input pairs remarkably affect the performance of the system.

The use of transfer learning in this type of Siamese architectures, whose proposal and assessment constitutes one of the novelties of the work, has been validated as a means of improving the overall performance of the recognition process. According to our experiments, when properly configured, the alternative classifiers significantly improve the results obtained by the base Siamese architecture considered. In addition, it must be noted that this transfer learning proposal also allows the application of Siamese-based schemes to other types of tasks as, for instance, search and indexing.

Regarding test augmentation as a means of performing ensemble learning, the results obtained in the experiments prove the validity of the proposal for improving the classification rates. Besides, this method reports its best performance when it is jointly used with train data augmentation since, as it has been shown, the isolated use of every single process does not report such a performance boost.

Finally, it may be highlighted that the introduced proposal improves the results for all individual corpora considered in their respective performance ranges. Among them, the CIFAR-10 set reports the lowest classification rates, possibly due to the inherent complexity of the collection compared to the other corpora: this set is the only one comprising color images with the largest image size among all of them.

7 Conclusions

Image Classification and Recognition, as one of the main tasks within the Computer Vision research field, is generally tackled with great success by resorting to the Deep Learning (DL) paradigm. Nevertheless, recognition systems based on DL generally require large amounts of data for properly converging. Hence, in cases in which available data is scarce, it is necessary to resort to Few-Shot Learning (FSL) methods, the research field devoted to addressing learning-based tasks in scenarios with limited amounts of data.

Siamese Neural Networks (SNNs) constitute one of the most representative solutions in the FSL field. These neural schemes consist of two identical Deep Neural Networks (DNNs) with shared weights which work on two different inputs for obtaining their corresponding embedded representation and eventually measuring their similarity degree. The fact that the internal DNNs share their weights along with the pair-wise arrangement of the training data allows the networks to converge with fewer data than without this Siamese scheme.

Within this context, this work assesses and expands the representation capabilities of SNNs for image recognition in FSL scenarios. More precisely, this study assesses the possibility of improving the classification results by considering feature learning and alternative learning algorithms. Besides, the inclusion of test data augmentation processes for addressing the recognition process as an ensemble-based classifier is also evaluated. The results obtained with different multi-class image corpora, Siamese topologies, neural configurations, and learning methods report that the different extensions studied further improve the classification rates of base SNNs. More precisely, the obtained results report that the best configuration, which is an SNN architecture trained with a categorical loss together with a k NN classifier and the aforementioned ensemble learning, achieves classification rates ranging from 69% to 78% with just 5 to 20 prototypes per class compared to the commonly contemplated Convolutional Neural Network (CNN) approach, which is unable to converge. Upon convergence of the CNN baseline considered with the sufficient amount of data, still the fine-tuned network improves the classification accuracy in figures from 4% to 9%.

Future work aims to tackle some of the inherent limitations of this proposal. A first point to address is extending this work to non-image FSL tasks as, for instance, audio-based similarity and sound tagging tasks. In this regard, a promising research avenue is that of integrating other base neural schemes different to the pure Convolutional one such as Recurrent or hybrid Convolutional-Recurrent Neural Networks for sequential data in an SNN architecture. Furthermore, we consider the proposal would also benefit by extending the study to other FSL-based schemes such as Matching Networks or Prototypical Networks and by exploring metric learning techniques for further exploiting the embedded space obtained. Finally, additional insights may be gathered by studying the possible improvement that this proposal would report in non-FSL scenarios.