1 Introduction

Fig. 1
figure 1

Top: source network pre-trained on ImageNet. Bottom: proposed multi-phase fine-tuning approach. The layers to be fine-tuned in the target network are adapted over several phases starting from the top layer.

Different hand gestures, facial expressions and body postures form grammatically-complete, highly-structured sign languages. Sign language recognition (SLR) can be categorized into three groups: recognition of alphabets [7, 24, 25], isolated words [13], or continuous sentences. In continuous SLR the input comprises a video containing a sequence of gestures and the desired output is a sequence of words complying to the grammar of that sign language [6, 11, 12].

Research on SLR has shifted to using deep learning methods, rather than hand-crafted features, the most prominent work includes [5, 6, 11, 12]. Recent methods even involve domain adaption [19] and sign language transformers [3]. The majority of the aforementioned approaches rely on transfer learning by including some pre-trained convolutional neural network (CNN), either for classifying individual frames or as a feature extractor. Transfer learning uses the weights learned while solving a different, yet related, task to initialize the weights of the network solving a new task. We refer to the former network as the source network, and to the latter as the target network.

Transfer learning involves making two important decisions: First, how many layers to transfer? Second, whether to freeze the transferred layers or to fine-tune them? [4, 5, 23, 26]. While no clear theory exists on how to make these choices, they depend on both the size of the target dataset and its similarity to the source data [26]. Answering these questions becomes even more complicated when the source and target tasks differ strongly.

Existing work on transfer learning for CNNs can be divided into two groups: (a) using a pre-trained source CNN just as a feature extractor [16, 20]; (b) transferring some layers’ weights to a target network, randomly initializing the weights of the non-transferred layers, and fine-tuning the target network on the target domain [23, 26]. Fine-tuning can be done by learning either (1) the weights in the non-transferred layers only [14], or (2) also the weights in some transferred layers, where usually the top-k layers are trained at once. We refer to the second fine-tuning method as single-phase fine-tuning. Research shows a clear advantage for fine-tuning weights in transferred layers in contrast to freezing them [15, 26]. However, if the target data is inherently different from the source data, these fine-tuning methods do not always works [1, 2, 20, 26].

In this paper, we propose multi-phase fine-tuning for tuning deep networks from everyday object recognition to SLR. The concept is depicted in Fig. 1. It extends the successful idea of transfer learning by fine-tuning the network’s weights in several phases. In the first phase, only a few topmost layers are fine-tuned. In successive phases, more layers are added and jointly fine-tuned with the layers from previous phases. We evaluate our proposed approach using GoogLeNet [21] on frame-level classification in continuous sign language videos, a considerably different domain from ImageNet [18] (see Fig. 1). CNNs applied to individual frames are a key step in several SLR systems [5, 6, 11, 12, 17]. Our results show that multi-phase fine-tuning considerably improves task performance and that it converges faster with fewer learning epochs compared to the earlier single-phase fine-tuning.

Recent SLR methods rely on deep learning methods. Most methods employ some pre-trained network, however little research has been done in investigating what is the best approach to fine-tune these pre-trained networks to the new task, especially since it is usually quite different than the source task (e.g. object recognition when pre-trained on ImageNet). In this work, we focus on investigating how to fine-tune the network from the task of object recognition to sign language recognition. We believe that fine-tuning a pre-trained network phase-wise would allow the top layers to adapt to the new tasks, while keeping the shallower layers unchanged, thereby improving the generalization capabilities of the network.

Our contribution is threefold. (1) We introduce a multi-phase fine-tuning strategy that improves accuracy and additionally allows faster training. We extend [26] by training the unfrozen layers step-wise as opposed to fine-tuning them all at once. (2) We demonstrate the success of multi-phase fine-tuning for transfer learning between two very different domains: from everyday object recognition to SLR. (3) We present a CNN-based approach for frame-based SLR which can be valuable for the sign language community.

The remainder of the paper is structured as follows: in Sect. 2, we thoroughly explain our proposed fine-tuning approach. In Sects. 3 and 4 we demonstrate our experimental setup and results. Section 5 concludes the paper.

Fig. 2
figure 2

Top: single-phase fine-tuning unlocks and trains weights in all of the top-k (here \(k = 3\)) layers of a CNN simultaneously. Our multi-phase fine-tuning (bottom) trains the weights in the top-k layers in several phases, successively adding more layers

2 Methods

In this section, we give an overview of our CNN training, standard fine-tuning methods, and illustrate our proposed multi-phase fine-tuning approach.

2.1 CNN Training

A CNN function maps the input x to a predicted label \(\hat{y}=f(x; w)\), given trainable weights w. In supervised learning, CNNs are trained using stochastic gradient descent (SGD), given a training data set \({D} = \{(x^i, y^i)\}^{N}_{i=1}\) with N inputs \(x^i\) and labels \(y^i\). SGD alternates between feedforward and backpropagation steps using mini-batches of m examples from the training set. A minibatch is a subset \(\{(x^i, y^i)\}_{i\in I} \subset D\), where \(I\subset \{1, \ldots , N\}\) such that \(|I|=m\).

In the feedforward step, a prediction \(\hat{y}^i\) is computed for each sample \(x^i\) in the mini-batch given the current weights w. A scalar loss between the true labels and predictions is calculated by

$$\begin{aligned} \mathcal {L}(w) = \frac{1}{m} \sum _{i\in I}\mathcal {L}_i(\hat{y}^i, y^i) = \frac{1}{m} \sum _{i\in I}\mathcal {L}_i(f(x^{i}; w), y^{i}), \end{aligned}$$
(1)

where \(\mathcal {L}_i\) is the per-sample loss function, e.g., cross entropy for classification.

For backpropagation the gradient of \(\mathcal {L}\) with respect to the weights w is first evaluated. We apply SGD with momentum, the initial weights \(w_0\) are drawn randomly. The velocity \(\epsilon _0\) representing the past gradients is initialized to zero. At training iteration \(t\ge 1\),

$$\begin{aligned} \begin{aligned} g_t&= \frac{1}{m} \nabla _w \sum _{i\in I} \mathcal {L}_i(f(x^{i}; w_{t-1}), y^{i})\\ \alpha _t&= (1-\psi ) \alpha _{t-1}\\ \epsilon _{t}&= \gamma \epsilon _{t-1} - \alpha _t g_t\\ w_{t}&= w_{t-1} + \epsilon _{t}, \end{aligned} \end{aligned}$$
(2)

where \(g_t\) is the current gradient estimate, \(\epsilon _t\) is the step for modifying the weights, dependent on the former gradients weighted by momentum \(\gamma\), and the current gradients weighted by the learning rate \(\alpha\) that decays at a rate of \(\psi\).

Fig. 3
figure 3

Sample image sequence from RWTH-PHOENIX-Weather dataset [8]. It contains video sequences from German broadcast news along with their sentence annotations (in German). Authors of [12] have automatically aligned labels to each frame in the video sequence. Each word is further split into three word-part labels; an example is shown for the word “Temperatur” (English: temperature)

2.2 Single-Phase Fine-Tuning

The initial weights \(w_0\) for a target network, apart from the last classifying layer, are initialized to pre-trained values from a source network. The classifying layer is modified to have as many neurons as the number of classes in the target task and is initialized with random weights. Weights of the target network are then fine-tuned, via Eq. (2), using a training dataset from the target domain.

A key question is whether to freeze transferred weights or fine-tune them to the new task. Freezing weights is often referred to as “off-the-shelf” transfer learning [20]; only the weights in the last classifier layer are updated.

If fine-tuning is applied to other layers as well, typically the k topmost layers are fine-tuned while keeping the other layers’ weights at their source network values [23, 26]. We refer to this approach as single-phase fine-tuning. For a network with a total of L layers, we use the notation top-k layers to refer to updated weights in layers \((L-k+1, \ldots , L)\). Weights in layers \((1, \ldots , L-k)\) remain frozen. Single-phase fine-tuning of the top-3 layers is illustrated in Fig. 2 (top).

2.3 Multi-phase Fine-Tuning

We propose a multi-phase fine-tuning approach where the top-k layers are trained sequentially with a step-size s in (k/s) phasesFootnote 1 until all of the k layers have been fine-tuned. In the first phase we fine-tune only the top-s layers. In each of the following phases, we add s more layers to be fine-tuned. At each phase, training continues until a pre-specified termination criterion is reached, e.g., the maximum number of training epochs or saturation of the validation loss.

For example, fine-tuning top-k layers with a step-size \(s=1\) for \(k=3\) has three phases; P1, P2, and P3 (see Fig. 2):

  1. P1

    Start by fine-tuning one layer, e.g., only the topmost layer of the network.

  2. P2

    Include one more layers for a total of 2 and fine-tune the top-2 layers.

  3. P3

    Add again one layer for a total of 3 and fine-tune the top-3 layers.

We remark that if \(s=k\), multi-phase fine-tuning is equivalent to single-phase fine-tuning of top-k layers.

3 Experimental Setup

In Sect. 3.1 we describe the dataset and the evaluation metrics applied in this work. Section 3.2 covers the implementation details.

3.1 Dataset and Metrics

We use RWTH-PHOENIX-Weather Multisigner 2014 [8, 10], one of the largest, publicly available, annotated datasets in the sign language domain. It has of 6841 videos of continuous signing in German sign language, each video labelled with an output sentence as a sequence of words (Fig. 3). Note that the resulting sequence of words is not a translation to spoken language, rather a literal translation of the signs.

We solve a classification problem where the input is a single frame and as output label we use the frame-to-label alignments provided by [12]. Each word is split into three parts each making up one label as depicted in Fig. 3, resulting in 3693 classes for 500,000 frames. We reserve 10% of the images for validation. Throughout our experiments, we record the top-1 and top-5 classification accuracies.

3.2 Implementation Details

CNN Architecture: We opt for GoogLeNet [21] with inception V3 [22] pretrained on ImageNet as the source network. It is the most commonly used network in recent SLR [6, 11, 12]. GoogLeNet consists of several (precisely 8) inception modules, so we investigate the effect of fine-tuning a varying number of such modules instead of layers. Thus, we will be referring to layers as modules in our notation (top-k modules instead of top-k layers).

Fine-Tuning Setup: We fine-tune the top-k modules of the network, for \(k = 1, 2, 3, \ldots , 8\). We compare the accuracy of our proposed multi-phase fine-tuning to traditional single-phase fine-tuning. For multi-phase fine-tuning, we report results for step size \(s=1, 2, 3\) for all values of k. In all cases, the fully-connected layers are always trained from scratch. We note that fine-tuning the top-8 inception modules is equivalent to fine-tuning the entire network.

Training Hyperparameters: We apply SGD with Nesterov momentum \(\gamma = 0.9\), and learning rate \(\alpha = 0.01\) that decays with rate \(\psi = e^{-6}\), and batch size \(m = 32\). We apply a categorical cross-entropy loss. We adopt an early-stopping approach, where training is terminated if the validation loss does not improve for 3 consecutive epochs. Random weights are initialized using Xavier normal initializer [9].

4 Results and Analysis

In this section we report results of our baseline, single- and multi-phase fine-tuning experiments, in addition to hyperparameter exploration for mulit-phase fine-tuning.

Table 1 Top-1 and top-5 classification accuracies of baseline methods.
Table 2 Top-1 and top-5 accuracies when fine-tuning the top-k modules of GoogLeNet either in a single-phase or multiple phases with a step size \(s=1\)

4.1 Baseline Experiments

Frame-based recognition is a submodule in currently existing SLR systems [5, 6, 11, 12], however, it is not addressed separately. Therefore, we assess the base difficulty of the task with three baseline methods and report top-1 and top-5 accuracies in Table 1.

To see how ImageNet features perform on the new task, we apply GoogleNet pre-trained on ImageNet as a feature extractor and train a fully-connected classifying layer on top. We also try two non-deep-learning methods to assess the difficulty of the problem. (1) Using SIFT features we extract the image descriptors, normalize and vector-quantize them using k-means to an 800-dimensional feature vector. A random forest classifier with eight trees and a maximum depth of 30 is trained for classification. (2) Using HOG features, we extract a feature vector for each image, and train a logistic regression classifier via SGD.

HOG with logistic regression performs best reaching a top-1 accuracy of 16.9%. The way it outperforms GoogLeNet as a feature extractor suggests that the learned features do not transfer very well to the new target domain.

4.2 Single-Phase vs. Multi-phase Fine-Tuning

We compare the proposed multi-phase fine-tuning with the standard single-phase fine-tuning. Table 2 shows the classification accuracies for both methods with step size \(s=1\). We note that fine-tuning only the topmost module (\(k=1\)) already outperforms our baseline results from Table 1. For all values of k modules that are fine-tuned, we observe that multi-phase fine-tuning consistently reaches a higher accuracy than fine-tuning the same modules in a single phase.

Fig. 4
figure 4

Left: Top-1 accuracy as a function of the number of modules k fine-tuned for multi-phase fine-tuning with step size \(s=1\) and single-phase fine-tuning. Right: Number of training epochs. Note that for \(k=s=1\), the two fine-tuning approaches are equivalent

Figure 4 (left) visualizes the top-1 accuracy as function of the number of modules fine-tuned. We note that with multi-phase fine-tuning the accuracy constantly improves as more modules are included. For single-phase fine-tuning, accuracy starts to degrade for \(k>4\).

Moreover, multi-phase fine-tuning requires less training epochs, see Fig. 4 (right). Training the network in multiple phases gives top layers the chance to adapt to the new task while lower layers remain unchanged. Our results show that this property of multi-phase fine-tuning improves the generalization capability of the network. Fine-tuning pre-trained layers’ weights should not be done while random weights of newly added fully-connected layers are yet to be trained. We hypothesize that the pre-trained layers’ weights may prematurely start to adapt to the random weights.

4.3 Different Step-Sizes

Fig. 5
figure 5

Top: Top-1 accuracy (left) and total number of fine-tuning epochs (right) for single- and multi-phase fine-tuning with stepsize \(s=2\). Bottom: Top-1 accuracy (left) and total number of fine-tuning epochs (right) for single- and multi-phase fine-tuning with stepsize \(s=3\). Note: for \(k=s=2\) (top) and \(k=s=3\) (bottom), both approaches are equivalent

The step size s controls how many new modules are added for fine-tuning in each phase. We varied the step size to observe how it affects fine-tuning performance. Top-1 accuracies for \(k=1, \ldots , 8\) modules fine-tuned with step-sizes \(s=2\) and \(s=3\) are presented in Fig. 5 (left). We note that multi-phase fine-tuning (with \(s=2\) and \(s=3\)) still outperforms single-phase fine-tuning. The number of required training epochs shown in Fig. 5 (right), shows that multi-phase fine-tuning converges faster also in this case, although the difference is not as significant as when comparing single-phase fine-tuning to using step-size \(s=1\). Applying a larger step-size \(s=2\) or \(s=3\) does not improve overall performance compared to \(s=1\). Since \(k=6\) is the only value that is comparable for step-sizes \(s=1,2\) and 3, we compare the top-1 accuracy achieved by fine-tuning top-6 modules using the aforementioned step-sizes in Table 3. The smallest step-size achieves the best performance with the least training epochs.

Table 3 Effect of step-size s for fine-tuning top-6 modules by multi-phase fine-tuning
Fig. 6
figure 6

Validation loss as a function of the number of training epochs when fine-tuning top-k modules of GoogLeNet using single-phase fine-tuning and multi-phase fine-tuning with step-size \(s=1\). Training was terminated using an early-stopping approach. Note that epoch 1 for all experiments is the first training epoch after training the classifying fully-connected layers

4.4 Comparison of Training Progress

We examined the training progress by recording the validation loss as a function of the number of training epochs for the best-performing multi-phase fine-tuning with step-size \(s=1\) and single-phase fine-tuning. The results are shown in Fig. 6 for training \(k=3, 4, \ldots , 8\) of the topmost modulesFootnote 2

We observe that for most values of k, applying single-phase fine tuning results in a sharp increase in the validation loss before it starts to decrease. In contrast, multi-phase fine-tuning results in a consistently decreasing validation loss for all values of k. Although the same parameters are eventually trained by both approaches, we believe that dividing the training into multiple phases is beneficial as it allows smoother changing of the layer weights.

For example, consider the top-3 layers, indexed by \((L-2)\), \((L-1)\), and L, where L is the final layer of the network. By unfreezing all the layers at once, weights in layer \((L-2)\) can start to prematurely adapt to those in layers \((L-1)\) and L, which may still be far from the values they eventually converge to. Including more layers over several phases smooths abrupt changes in layer weights.

Results suggest that multi-phase fine-tuning can also provide an experimental way to decide how many layers should be fine-tuned. We can add more layers in phases and monitor the validation loss. As long as performance improvements are observed, we can continue fine-tuning more layers.

5 Conclusion

A key question in transfer learning is how many layers to fine-tune to take advantage of the generality of lower layers’ features, while allowing the network to fit to the target task. We proposed multi-phase fine-tuning, starting by only fine-tuning the weights in the last fully-connected layer, and adding more layers in subsequent phases. We applied it to transfer learning from the domain of object recognition to SLR using one of the most commonly used network architectures, GoogLeNet. Results show that compared to earlier fine-tuning approaches, multi-phase fine-tuning has a higher classification accuracy and requires less training time for this pair of domains. In addition, it provides a constructive approach to decide how many layers’ weights to fine-tune. Future work includes extending the work presented here into a complete continuous sign language recognition system working on sequences of gestures. We also aim to investigate the applicability of multi-phase fine-tuning in other domains beyond sign language recognition.