1 Introduction

The challenges associated with acquiring large-scale datasets and corresponding labels have spurred research efforts in the field of few-shot learning. Drawing inspiration from the achievements in few-shot learning for image understanding tasks (Li et al., 2019; Chowdhury et al., 2021; Xu et al., 2021; Kang et al., 2019; Fan et al., 2020; Wang et al., 2019, 2020; Liu et al., 2020a, b), researchers have increasingly directed their attention towards the few-shot action recognition domain (Kliper-Gross et al., 2011). The development of effective techniques for this task holds the potential to reduce the expenses associated with video annotation substantially (Grauman et al., 2021) and facilitate real-world applications in scenarios where acquiring labels is particularly challenging (Huang et al., 2018, 2020).

Previous studies on few-shot action recognition usually determine the category of a query video by its similarity to the few labeled support videos. Most of these studies (Zhu & Yang, 2020; Zhu et al., 2021a; Fu et al., 2020) follow the approach introduced by ProtoNet (Snell et al., 2017), where a prototype representation for each video is learned and video similarity is measured based on the similarity between these prototypes. Following ProtoNet (Snell et al., 2017), many works (Zhu & Yang, 2020; Zhu et al., 2021a; Fu et al., 2020; Wang et al., 2022, 2023a) first learn a prototype for each video and then compute the video similarity based on the similarity of the prototypes. Recent advances in the field have emphasized the need to incorporate temporal dependencies within videos, e.g., temporal ordering and relations. Consequently, approaches have emerged where they construct sub-sequence prototypes by considering different segments of the videos. Video similarity is then computed by matching the prototypes derived from the support and query videos (Cao et al., 2020; Perrett et al., 2021; Wu et al., 2022; Nguyen et al., 2022; Zheng et al., 2022).

Previous methods, despite incorporating temporal dependencies, exhibit certain limitations. Firstly, these methods fail to fully leverage the spatiotemporal relations in videos for discriminating between actions such as “put A on B” and “put A beside B”, due to their inability to consider spatial information. Since these actions differ primarily in the relative spatial positioning, the absence of spatial information hinders accurate discrimination. Secondly, the use of fixed temporal locations for generating sub-sequence prototypes restricts the ability to effectively handle actions occurring at different timestamps with varying speeds. As a result, the similarity of videos where the actions within have different temporal dynamics cannot be adequately compared.

To overcome these limitations and enhance the robustness of few-shot action recognition, we aim to investigate how to: (1) Enhance the generation of prototypes such that they can effectively capture and encode the spatiotemporal relations present in the videos. By developing improved methods for generating prototypes, we seek to create representations that robustly capture the intricate interplay between spatial and temporal information, enabling more accurate discrimination of actions such as “put A on B” versus “put A beside B” that rely on relative object positions. (2) Enable prototypes encoding the same aspect of actions to be compared. We dive into two directions to achieve this: generating prototypes that account for actions with different temporal dynamics, and matching prototypes flexibly at different timestamps. Instead of exhaustively matching all possible combinations of the prototypes (Perrett et al., 2021), we explore innovative approaches that can effectively match prototypes between two videos.

Tackling the first point, it is possible to use object features extracted from object bounding boxes (He et al., 2017; Huang et al., 2022). While being effective, the use of object features in the few-shot action recognition task introduces additional information and computing, which may not be practical when the resource is limited (Huang et al., 2022). In this work, different from object features, we propose a multi-level multi-relation encoder (MM-encoder), which takes as input not the object features but image patches at different scales, to encode the spatiotemporal relations. The use of image patches relieves the need for additional object detectors, providing a more compact representation that can be efficiently processed. Furthermore, the pyramid structure can also allow for exploiting different spatial resolutions, capturing information at multiple scales. This enables us to generate prototypes describing the videos that consider the interactions and contextual dependencies between patches across frames.

To address the second point, with the features in the first step, we propose to generate three types of prototypes with individual matching strategies. (1) The action-centered prototypes consider all frames within the input video. This is achieved by leveraging the self-attention mechanism offered by Transformers (Vaswani et al., 2017). We employ fixed 1-to-1 matching and a diversity loss to encourage the prototypes to capture different aspects of the action. (2) Further, we propose to use another type of timestamp-centered prototypes, where each prototype is encouraged to focus on specific timestamps of the video. We use bipartite matching to establish correspondences between the timestamp-centered prototypes of the support and query videos with an attention loss in the prototype learning. This matching scheme allows for the comparison of actions starting and different timestamps and evolve with different durations and speeds, enabling a more comprehensive and robust similarity measurement. (3) Additionally, we propose a new form of summarized prototypes, and introduce a shuffling mechanism to produce auxiliary summarized prototypes by rearranging the order of the timestamp-centered prototypes. By incorporating an inconsistency loss, we can enhance the temporal specificity of the timestamp-centered prototypes while allowing the model to generate an additional summarization of the entire video for the computation of similarity.

Fig. 1
figure 1

Concept of the proposed compound prototype matching. In this example, the two videos are of the same class “pretend to put something into something”. The action-centered prototypes (green) are encouraged to learn certain aspects of the action such as the first occurrence of hand regardless of the timestamp. For example, the leftmost prototypes focus on the first and second frames of the support video, and the second and third frames of the query video. They do not attend on the same timestamp but both represent the first occurrence of hand. They are thus matched in a one-to-one manner. The timestamp-centered prototypes (blue) focus on specific timestamps of the video. To address the issue that actions happen at different temporal locations inside a video, the timestamp-centered prototypes can be flexibly matched via bipartite matching. As shown in the figure, the prototype that focuses on frames 6–7 of the support video is matched with the prototype of frames 5–6 of the query video. They both represent the frames where the person takes the object away from the container. The summarized prototypes (red) are generated from the timestamp-centered prototypes and are used to regularize the temporal consistency of the timestamp-centered prototypes while serving as an auxiliary summarization of the whole video (Color figure online)

Our method utilizes a compound of three groups of prototypes, which we refer to as compound prototypes, for the calculation of video similarity. By incorporating the action-centered, timestamp-centered, and summarized prototypes, our approach effectively captures various aspects of the action and accommodates variations in action duration and speed. Through extensive experimental evaluations on multiple benchmark datasets, our method consistently showcases remarkable superiority over previous approaches, particularly in scenarios where only a single annotated example is accessible for each action (1-shot). This outcome highlights the effectiveness of utilizing compound prototypes in similarity measurement and further validates the strength of our approach in the context of few-shot action recognition.

A preliminary version of this paper was published in Huang et al. (2022). In this paper, we extend that work from three perspectives: (1) In addition to extracting object features, we propose to encode multi-level patch features with a multi-level multi-relation encoder. This eliminates the need for additional off-the-shelf object detectors that contain a large amount of additional parameters. Therefore this can further boost the practicability of our approach, particularly in low-computation resource cases. (2) We employ an additional summarized prototype in our compound prototype matching scheme. Cooperating with the specifically designed inconsistency loss, this prototype can enhance the temporal specificity of the timestamp-centered prototypes while enabling the model to generate more comprehensive prototypes for similarity computation. (3) We conduct more extensive experiments, including experiments on the newly proposed fine-grained dataset split of EPIC-Kitchens, new ablation studies, and additional analysis on models and losses. These enhancements and experimental extensions contribute to a more thorough investigation of our approach (Fig. 1).

To summarize, our key contributions include:

  • A novel method for few-shot action recognition based on generating and matching compound prototypes.

  • Our method achieves state-of-the-art performance on multiple benchmark datasets (Carreira & Zisserman, 2017; Goyal et al., 2017; Kuehne et al., 2011), outperforming previous methods by a large margin.

  • A detailed ablation study showing the usefulness of leveraging object information for few-shot action recognition and demonstrating how the different groups of prototypes encode the video from complementary perspectives.

2 Related Works

2.1 Few-Shot Image Classification

Broadly speaking, there are three categories for few-shot image classification methods. The first category comprises transfer-learning based methods (Dhillon et al., 2019; Qiao et al., 2018; Wang et al., 2020; Bateni et al., 2022) which leverage pre-training and fine-tuning techniques to enhance the performance of deep backbone networks in the context of few-shot learning. The second line of works focuses on rapidly learning an optimized classifier using limited training data (Andrychowicz et al., 2016; Gui et al., 2018; Ravi & Larochelle, 2017; Antoniou et al., 2019; Wei et al., 2019; Finn et al., 2017; Zhang et al., 2020a; Huang et al., 2020). These approaches aim to efficiently utilize the available labeled samples to achieve effective classification performance in a few-shot learning scenario. The third direction is centered on metric learning, whose primary objective is to learn feature embeddings that are more generalizable under a specific distance metric (Koch et al., 2015; Snell et al., 2017; Vinyals et al., 2016; Kang et al., 2021; Luo et al., 2022; Afrasiyabi et al., 2022; Yang et al., 2022). The key to metric-learning based methods is to generate robust representations of data under a certain metric, so that it may generalize to novel categories with few labeled samples. In this work, we align with the research conducted in the metric learning domain, specifically addressing the more complex and challenging setting of few-shot video classification.

2.2 Few-Shot Action Recognition

Most methods for few-shot action recognition on videos (Thatipelli et al., 2021; Li et al., 2021; Patravali et al., 2021; Hong et al., 2021; Bishay et al., 2019; Zhu et al., 2021a, b; Mishra et al., 2018; Xu et al., 2018) fall into the metric-learning framework. Many works follow the scheme of ProtoNet (Snell et al., 2017) to compute video similarity based on generated prototypes. For learning prototypes that can better generalize to novel classes with only limited labeled samples, ProtoGAN (Kumar Dwivedi et al., 2019) synthesizes additional features, CMN (Zhu & Yang, 2018, 2020) uses a multi-layer memory network, while ARN (Zhang et al., 2020b) uses jigsaws for self-supervision and enhances the video-level representation via spatial and temporal attention. There are also methods that perform pretraining with semantic labels (Wang et al., 2021; Xian et al., 2020, 2021) or use additional information such as depth (Fu et al., 2020) or optical flow (Wanyan et al., 2023) to augment video-level prototypes. Some contemporary works (Xing et al., 2023a; Wang et al., 2023b) use methods such as reinforcement learning (Xia et al., 2023), graph convolution networks (Xing et al., 2023b) to improve the few-shot performance, or consider the cross-domain scenario (Samarasinghe et al., 2023). The preliminary version of our method uses another form of additional information: the object bounding box from pretrained object detectors (Huang et al., 2022). In this work, we apply a multi-level multi-relation encoder that eliminates the need for additional information, while still achieving promising performance.

The temporal variability inherent in actions poses a significant challenge in the context of few-shot action recognition (Li et al., 2021). Unlike static images, videos introduce an additional temporal dimension, which complicates the comparison of video-level prototypes. Directly comparing such prototypes may result in matching misaligned actions and consequently lead to sub-optimal similarity measurements. To address this issue and effectively model temporal dependencies, recent research efforts have placed greater emphasis on the generation and matching of sub-video level prototypes. OTAM (Cao et al., 2020) uses a generalized dynamic time warping technique (Chang et al., 2019) to monotonously match the prototypes between query and support videos. ITA-Net (Zhang et al., 2021b) first implicitly aggregates information for each frame using other frames and then conducts a 1-to-1 matching of all prototypes. However, these two methods use frame-wise prototypes, and thus cannot well capture the higher-level temporal relation in multiple frames. More recently, TRX (Perrett et al., 2021) constructs prototypes of different cardinalities for query and support videos, and calculates similarity by matching all prototype pairs. STRM (Thatipelli et al., 2021) incorporates a combination of local and global enrichment mechanisms to enhance spatio-temporal modeling based on TRX. These prototypes focus on reasoning over the static, relational aspects of actions. There are also works that leverage motion information (Wang et al., 2023a; Wu et al., 2022), task-specific features (Wang et al., 2022), or multi-level metrics (Zheng et al., 2022; Nguyen et al., 2022) for better similarity comparison.

We summarize three main differences compared with previous works: (1) We design a multi-level multi-relation encoder to jointly encode spatiotemporal information, forming more robust prototypes. Previous works either only focus on temporal relation modeling (Wang et al., 2023a; Perrett et al., 2021; Wu et al., 2022; Zheng et al., 2022), or separately match the spatial features and temporal features between videos (Nguyen et al., 2022). (2) We generate a compound of action-centered, timestamp-centered, and summarized prototypes to represent the actions from diverse perspectives. (3) The three groups of prototypes are efficiently matched to robustly compute video similarity.

2.3 Transformers

Transformers (Vaswani et al., 2017) have recently acquired remarkable achievements in computer vision (Carion et al., 2020; Liu et al., 2021; Deng et al., 2021; Doersch et al., 2020; Zhang et al., 2021a; Yang et al., 2021; Huang et al., 2023). FEAT (Ye et al., 2020) represents works that apply transformer in the few-shot learning task, and TRX (Perrett et al., 2021) first introduces Transformer (Doersch et al., 2020) into the few-shot action recognition task. Different from TRX, we apply a Transformer encoder-decoder to generate compound prototypes, and show that this is more effective in the few-shot action recognition scenario.

3 Method

3.1 Problem Setting

In the task of few-shot action recognition, the objective of a model is to classify an unlabeled video (referred to as “query”) into one of the target categories, each of which is associated with a limited number of labeled examples (referred to as “support set”) (Cao et al., 2020; Perrett et al., 2021). Following the common of previous works (Vinyals et al., 2016; Finn et al., 2017), in this work we use episodic training, where in each episode a C-way K-shot problem is sampled: the support set \(\mathcal {S}=\{\varvec{X}^{j}_s\}_{j=1}^{C\times K}\) is composed of \(C\times K\) labeled videos from C different classes where each class contains K samples. The query set contains N unlabeled videos \(\mathcal {Q} = \{\varvec{X}_q^i\}_{i=1}^N\). The goal is to classify each video in the query set as one of the C classes.

3.2 Proposed Method

We present a novel approach for few-shot action recognition that involves the generation and matching of compound prototypes between query and support videos. As depicted in Fig. 2, our method starts by inputting the video to an embedding network, extracting frame-wise features \(\varvec{F}_a\) and patch-level object features \(\varvec{F}_p\). \(\varvec{F}_a\) is given by global average pooling of the feature map by embedding network, while \(\varvec{F}_p\) is acquired by l-level average pooling with kernel sizes \(A=\{a_1, \ldots , a_l\}\). The pooling stride is the same as the kernel size. With the multi-level features \(\varvec{F}_a\) and \(\varvec{F}_p\), a multi-level multi-relation encoder (MM-Encoder) uses \(\varvec{F}_a\) and \(\varvec{F}_p\) to output features \(\varvec{F}_m\) containing spatiotemporal global–local relations. Then a compound prototype decoder generates action-centered prototypes \(\varvec{P}_a\) and timestamp-centered prototypes \(\varvec{P}_t\) for each video. With the timestamp-centered prototypes \(\varvec{P}_t\), a Transformer based decoder called S-former is used to summarize the prototypes as summarized prototypes \(\varvec{P}_s\). During similarity calculation, we use fixed 1-to-1 matching on the action-centered prototypes \(\varvec{P}_a\) and summarized prototypes \(\varvec{P}_s\), and bipartite matching on the timestamp-centered prototypes \(\varvec{P}_t\) between two videos, encouraging the similarity to be computed robustly from diverse perspectives. In the following of this section, we introduce the details of each component.

Fig. 2
figure 2

Illustration of our proposed method on a 3-way 1-shot problem. First, the videos are processed by an embedding network to acquire global (frame-level) features \(\varvec{F}_a\) and patch features \(\varvec{F}_p\). Features \(\varvec{F}_a\) and \(\varvec{F}_p\) are equipped with 1D and 3D positional encoding (PE), respectively, and then used by an MM-Encoder (Sect. 3.2.2) to encode global-global, global-patch and patch-patch information into a multi-relation feature \(\varvec{F}_m\). Using \(\varvec{F}_m\), a Transformer-based compound prototype decoder transforms the learnable tokens \(\varvec{T}_a, \varvec{T}_t, \varvec{T}_s\) into compound prototypes that represent the input video (Sect. 3.2.3). The compound prototypes consist of several action-centered prototypes \(\varvec{P}_a\) (green squares), several timestamp-centered prototypes \(\varvec{P}_t\) (blue squares), and one summarized prototype \(\varvec{P}_s\). They are applied with different loss and different matching strategies, so that each action-centered prototype captures a certain aspect of the action summarized from the whole video, and each timestamp-centered prototype focuses on a specific temporal location of the video. The summarized prototypes further summarizes the whole video clip, while its learning process enhances the temporal specificity of the timestamp-centered prototypes. The final similarity score is calculated as the average similarity of all matched prototype pairs between support and query videos (Color figure online)

3.2.1 Feature Embedding

The goal of feature embedding is to comprehensively capture the spatiotemporal relations present in the videos. An apparent approach is to utilize object features extracted from object bounding boxes (He et al., 2017) to model the object relations. Since it may be not feasible to use an additional object detector in cases where the computational resource is limited, we additionally propose a patch-based feature encoding method as a trade-off between speed and performance.

Patch-based feature embedding We first introduce the patch-based feature embedding method. For each input video \(\varvec{X} \in \mathcal {S} \cup \mathcal {Q}\), we sample T frames following the sampling strategy of TSN (Wang et al., 2016). An embedding network first encodes the videos into a feature map \(\varvec{F}_v\in \mathbb {R}^{W\times H \times T\times d}\), where W and H are the spatial sizes of the feature map and d is the feature dimension. The global (frame-level) feature representation for each video \(\varvec{F}_a\in \mathbb {R}^{T\times d}\) is acquired by global average pooling of \(\varvec{F}_v\), while for the patch-level features, we obtain them by multiple patch-level average-pooling operations. We use a total of l-level average pooling, each level with kernel size \(a_l\) and stride \(a_l\), resulting patch-level features \(\varvec{F}_p \in \mathbb {R}^{LT\times d}\), where \(L = \sum _{i=1}^l \lceil \frac{HW}{a_l^2} \rceil \). Specifically, this is done by dividing the feature into \(a_l\times a_l\) grids, then pooling the features inside each patch of the grid.

Object-based feature embedding If external object detectors are available, it is possible to get finer object features for object relation modeling. We also experiment with this setting for a more comprehensive understanding. With the bounding boxes provided by pre-trained object detectors, we extract object features via ROI-Align (He et al., 2017) using the predicted bounding boxes on each frame. In this scenario, we use only B most confident boxes on each frame, forming object features \(\varvec{F}_o\in \mathbb {R}^{BT\times d}\) as a replacement of \(\varvec{F}_p\). For simplicity, in the following part of this section, we only use \(\varvec{F}_p\) to illustrate the remaining part of our model.

3.2.2 Multi-level Multi-relation Encoder

To better generate prototypes that are discriminative for actions that involve multiple spatiotemporal regions, we propose to use a multi-relation encoder to encode the spatiotemporal information from \(\varvec{F}_a\) and \(\varvec{F}_o\). We specifically consider the following three relations: global action-global action relation, global-patch relation, and patch-patch relation. To enable effective modeling of these relations across frames, we adopt the transformer architecture as our base architecture, which has demonstrated strong performance in capturing long-range dependencies (Vaswani et al., 2017). As shown in Fig. 2, our encoder comprises three relation encoding transformers (RETs). The global action-global action RET (RET\(_{aa}\)) and the patch-patch RET (RET\(_{pp}\)) share the same structure but differ in their inputs. The RET\(_{aa}\) takes the global action feature \(\varvec{F}_a\) as input while the RET\(_{pp}\) takes as input patch features \(\varvec{F}_p\). Both RETs use the input to generate the query \(\varvec{Q}\), key \(\varvec{K}\) and value \(\varvec{V}\) vectors which are used in the transformer:

$$\begin{aligned}&\varvec{F}_{aa} = RET _{aa}(\varvec{Q}=\varvec{K}=\varvec{V}=\varvec{F}_a) \end{aligned}$$
(1)
$$\begin{aligned}&\varvec{F}_{pp} = RET _{pp}(\varvec{Q}=\varvec{K}=\varvec{V}=\varvec{F}_p), \end{aligned}$$
(2)

The global-patch RET (RET\(_{ap}\)) works slightly differently, where it maps \(\varvec{F}_a\) as query vector, while \(\varvec{F}_p\) as key and value vectors:

$$\begin{aligned} \varvec{F}_{ap} = RET _{ap}(\varvec{Q}=\varvec{F}_a, \; \varvec{K}=\varvec{V}=\varvec{F}_p). \end{aligned}$$
(3)

Each RET within the MM-Encoder outputs a feature vector with the same dimension as its input query vector. As a result, for each of the T frames in the video, we obtain \(L+2\) feature vectors with dimension d. To integrate these relations we concatenate the outputs from the three RETs, resulting in a multi-relation feature \(\varvec{F}_m\in \mathbb {R}^{(L+2)T\times d}\).

In line with the effectiveness of positional encoding (PE) in transformer-based architectures (Vaswani et al., 2017; Lu et al., 2021; Liu et al., 2021), we incorporate PE into our model. For \(\varvec{F}_{a}\) we use 1D PE to encode the temporal location of each frame. This allows the model to capture the temporal order and relationships between frames. For \(\varvec{F}_{p}\) we apply 3D PE that incorporates spatial and temporal information, enabling the model to effectively distinguish between different objects and capture their dynamics over time. However, for the sake of simplicity, we have omitted the explicit equations for PE in the equations, unless otherwise stated.

3.2.3 Compound Prototype Decoder

The compound prototype decoder in our approach also adopts the transformer architecture (Vaswani et al., 2017; Sun et al., 2021; Cong et al., 2021; Carion et al., 2020), enabling the prototypes to be generated by considering features of all frames through the self-attention mechanism. As depicted in Fig. 2, the input to the prototype decoder contains three groups of learnable tokens \(\varvec{T}_a\in \mathbb {R}^{m_{a} \times d}\), \(\varvec{T}_t\in \mathbb {R}^{m_{t} \times d}\) and \(\varvec{T}_s\in \mathbb {R}^{d}\). A multi-head attention layer first encodes the tokens \(\varvec{T}_a,\varvec{T}_t\) into \(\hat{\varvec{T}}_a\) and \(\hat{\varvec{T}}_t\). Subsequently, another multi-head attention layer is employed to further transform these representations into two groups of prototypes \(\varvec{P}_a=\{\varvec{p}_{a,k}\}_{k=1}^{m_a}\in \mathbb {R}^{m_{a} \times d}\) and \(\varvec{P}_t=\{\varvec{p}_{t,k}\}_{k=1}^{m_t}\in \mathbb {R}^{m_{t} \times d}\). For simplicity, we omit the subscripts \(_{a,t}\) and all normalization layers, thus the equation can be written as:

$$\begin{aligned} \varvec{Q} = \hat{\varvec{T}}\varvec{W}_Q, \quad \varvec{K} = \varvec{F}_m\varvec{W}_K, \quad \varvec{V} = \varvec{F}_m\varvec{W}_V, \end{aligned}$$
(4)

where \(\varvec{W}_Q, \varvec{W}_K, \varvec{W}_V \in \mathbb {R}^{d \times d}\) are linear projection weights, then we have

$$\begin{aligned} \varvec{A} = softmax\left( \frac{\varvec{Q}\varvec{K}^T}{\sqrt{d}}\right) , \quad \varvec{P} = FFN(\varvec{A}\varvec{V}), \end{aligned}$$
(5)

where \(\varvec{A} \in \mathbb {R}^{m\times (L+2)T}\) is the self-attention weights, and FFN denotes the feed-forward network.

Furthermore, an S-former which is consisted of a multi-head attention layer and a feed-forward network, takes the learnable tokens \(\varvec{T}_s\) as the query vector, the timestamp-centered prototypes \(\varvec{P}_t\) as the query and key vectors. It generates a summarized prototype \(\varvec{P}_s\) for each video:

$$\begin{aligned} \varvec{P}_s = SF(\varvec{Q}=\varvec{T}_s, \; \varvec{K}=\varvec{V}=\varvec{P}_t) \end{aligned}$$
(6)

With the accompanied loss function \(L_{inc}\) as a constraint (to be described in Eq. 10), the summarized prototype will further summarize the video in a comprehensive manner, as well as promote the temporal specificity of the timestamp-centered prototypes \(\varvec{P}_t\).

To promote the diverse representation of different aspects of the action, we impose constraints on the three sets of prototypes individually. Specifically, for the action-centered prototypes \(\varvec{P}_a\), we expect each prototype to describe the action from a different perspective. Thus, we introduce a diversity loss term to maximize their diversity:

$$\begin{aligned} L_{div} = \sum _{i\ne j} sim(\varvec{p}_{a,i},\; \varvec{p}_{a,j}), \end{aligned}$$
(7)

where sim denotes the cosine similarity function.

Since learning \(\varvec{P}_a\) to robustly represent each aspect of the action (e.g., the start of the action) is difficult even with full annotation (Xu et al., 2020; Zeng et al., 2019). To increase the overall robustness, for the timestamp-centered prototypes \(\varvec{P}_t\) we encourage them to describe specific temporal locations of a video. Therefore we instead use an attention loss, where a regularization on the self-attention weight \(\varvec{A}_t\) is added so that different \(\varvec{p}_t\) can focus on different temporal locations of the video:

$$\begin{aligned} L_{att} = \sum _{i\ne j} sim(\varvec{\alpha }_{t,i},\; \varvec{\alpha }_{t,j}). \end{aligned}$$
(8)

Here \(\varvec{\alpha }_{t,i} \in \mathbb {R}^{(L+2)T}\) denotes the ith row in \(\varvec{A}_t\). This regularization promotes diversity and flexibility in the temporal focus of the timestamp-centered prototypes, enhancing their ability to capture various temporal aspects of the action.

Since the timestamp-centered prototypes are constrained to focus on specific temporal locations of a video, there should be a great difference if we shuffle the order of the timestamp-centered prototypes. For example, if we reverse the order of an action “take”, the resulting action will be “put”. With this observation, we design an inconsistency loss \(L_{inc}\) and apply it on the summarized prototypes to enhance the temporal specificity of the timestamp-centered prototypes, as well as further summarize the action as a whole. As for the summarized prototypes, for each \(\varvec{P}_s\), we generated x auxiliary summarized prototypes \(\hat{\varvec{P}}_s\) by giving random shuffling to \(\varvec{P}_t\) to change its orders. Equivalently, we realize this by giving different (shuffled) positional encodings to \(\varvec{P}_a\) before inputting it to the S-former as:

$$\begin{aligned} \hat{\varvec{P}}_s^x = SF\big (\varvec{Q}=\varvec{T}_s, \; \varvec{K}=\varvec{V}=\varvec{P}_t + \hat{PE}^x(\varvec{P}_t)\big ), \end{aligned}$$
(9)

where \(\hat{PE}^x(\cdot )\) denotes the xth shuffled positional encoding. The \(\hat{PE}^x(\cdot )\) are shown as red squares with dotted borders in Fig. 2. Since we hope to ensure the temporal specificity of the timestamp-centered prototypes, we add an inconsistency loss between \(\varvec{P}_s\) generated by normal positional encoding and \(\hat{\varvec{P}}_s\) generated with shuffled positional encoding:

$$\begin{aligned} L_{inc} = \sum _{i=1}^x sim\left( \varvec{P}_s, \; \hat{\varvec{P}}_s^i\right) \end{aligned}$$
(10)

3.2.4 Compound Prototype Matching

In conjunction with the compound prototypes, we employ distinct matching strategies to compute the overall similarity between two videos. As shown in Fig. 2, for the action-centered prototypes \(\varvec{P}_a\), we adopt a 1-to-1 matching scheme, i.e., the ith prototype of the query video is always matched with the ith prototype of the support video. To calculate the action-centered prototypes’ overall similarity score between video x and video y, we average the similarity score of all the \(m_a\) action-centered prototypes:

$$\begin{aligned} s_a^{x,y} = \frac{1}{m_a}\sum _{k=1}^{m_a}sim \left( \varvec{p}_{a,k}^x,\; \varvec{p}_{a,k}^y\right) . \end{aligned}$$
(11)

Thus, to maximize the similarity score of correct video pairs and minimize the similarity of incorrect video pairs during episodic training, each \(\varvec{p}_a\) will try to encode a certain aspect of the action, e.g., the start of the action. This phenomenon is supported by our experiments in Sect. 4.

For the timestamp-centered prototypes \(\varvec{P}_t\), we apply a bipartite matching-based similarity measure. Since different actions may happen in different temporal positions in the videos, the bipartite matching enables the temporal alignment of actions, allowing the comparison between actions of different lengths and at different speeds. Formally speaking, for \(\varvec{P}_t^x\) of video x and \(\varvec{P}_t^y\) of video y, we find a bipartite matching between these two sets of prototypes by searching for the best permutation of \(m_t\) elements with the highest cosine similarity using the Hungarian algorithm (Kuhn, 1955). Denote \(\sigma \) as the best permutation, the similarity score of \(\varvec{P}_t^x\) and \(\varvec{P}_t^y\) is calculated by:

$$\begin{aligned} s_d^{i,j} = \frac{1}{m_t}\sum _{k=1}^{m_t} sim \left( \varvec{p}_{t,k}^x,\; \varvec{p}_{t,\sigma (k)}^y\right) . \end{aligned}$$
(12)

For the summarized prototypes \(\varvec{P}_s\), the matching is straightforward:

$$\begin{aligned} s_s^{x,y} = sim\left( \varvec{P}_s^x, \; \varvec{P}_s^y\right) \end{aligned}$$
(13)

Finally, the similarity score is computed by a weighted average of \(s_a\), \(s_t\), and \(s_s\): \(s^{a,b} = \lambda _1 s_a^{a,b} + \lambda _2 s_t^{a,b} + \lambda _3 s_s^{a,b}\). During training, this similarity score is directly regarded as logits for the cross-entropy loss \(L_{ce}\). The total loss function is a weighted sum of three losses:

$$\begin{aligned} L = w_1L_{ce} + w_2L_{div} + w_3L_{att} + w_4L_{inc}. \end{aligned}$$
(14)

During the inference phase, we assign the label of the query video to be the same as the label of the most similar video in the support set. By associating the query video with the label of its most similar counterpart in the support set, we make predictions for the query video in a few-shot manner, leveraging the available labeled examples in the support set.

4 Experiments

We conduct experiments on five publicly available datasets: Kinetics (Carreira & Zisserman, 2017), Something-something V2 (SSv2) (Goyal et al., 2017), HMDB (Kuehne et al., 2011), UCF (Soomro et al., 2012) and EPIC-Kitchens (Damen et al., 2018).

For the Kinetics and SSv2 datasets, they were split into 64/12/24 classes for training, validation, and testing, respectively. In the case of SSv2, we used both the split from CMN (Zhu & Yang, 2018) (referred to as SSv2-Small) and the split from OTAM (Cao et al., 2020) (referred to as SSv2-Full). Recently, Zhang et al. (2020b) proposed new splits for HMDB (Kuehne et al., 2011) and UCF (Soomro et al., 2012) datasets and Wang et al. (2022) proposed new splits on the EPIC-Kitchens dataset. We also conduct experiments on these two datasets using the split from Zhang et al. (2020b), Wang et al. (2022). We report the performance of our method in both standard 1-shot 5-way setting and 5-shot 5-way setting. To ensure a reliable and consistent evaluation, following previous works, we report the average result of 10,000 test episodes in the experiments.

Additionally, for the EPIC-Kitchens (EK) dataset, since it contains fine-grained annotations of verbs and nouns (Damen et al., 2018), we create a new split on the EPIC-Kitchens dataset, considering the fine-grained actions. This is different from the split from Wang et al. (2022) where only coarse action labels (verbs) are considered. By creating this new split that takes into account the fine-grained actions, we aim to provide a more detailed evaluation of our method’s performance on the EPIC-Kitchens dataset.

4.1 Baselines

We compare our method with recent works reporting state-of-the-art performance, including MatchNet (Vinyals et al., 2016), CMN (Zhu & Yang, 2020), OTAM (Cao et al., 2020), TRN (Zhou et al., 2018), ARN (Zhang et al., 2020b), TRX (Perrett et al., 2021), ITA-Net (Zhang et al., 2021b), MTFAN (Wu et al., 2022), Nguyen et al. (2022), STRM (Thatipelli et al., 2021), HyRSM (Wang et al., 2022) and MoLo (Wang et al., 2023a). Following Zhu et al. (2021a), we also compare with the few-shot image classification model FEAT (Ye et al., 2020) which is also based on transformers. We consider the standard few-shot learning setting in our experiments in Sect. 4.3, which is consistent with the equations described in our method. In addition, for a fair comparison with some of the previous works, we experiment with another setting that uses semantic labels in the support set in Sect. 4.8.4. This additional setting is not the standard few-shot learning setting, where an extra classification layer with cross-entropy loss is added to the model to predict the semantic classification label of the inputs in the support set.

Since this work is an extension of our previous work (Huang et al., 2022), we also compare with this method (denoted as “Ours-”). Our previous method used object features extracted from off-the-shelf object detectors. Since no previous works used object detectors in few-shot action recognition, for a fair comparison with previous works, we exclude the use of object detectors in this table and discuss the effect of object features later in this section.

4.2 Implementation Details

For most of our experiments, we use ResNet-50 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) as the backbone of embedding network, and a fixed Mask-RCNN (He et al., 2017) trained on COCO dataset (Lin et al., 2014) is used as the bounding box extractor. We also discuss the effect of changing backbones in the experiments. We select \(l=2\) stages of multi-level average pooling operation, where the pooling kernel sizes and strides are \(a_1 = 2, a_2=3\). The pre-processing steps follow OTAM (Cao et al., 2020) where we also sample \(T=8\) frames with random cropping during training and center crop during inference. We also apply additional data augmentation during training, i.e., random color jittering and random elastic transform. To optimize our model, we employed stochastic gradient descent (SGD) with an initial learning rate of 0.001. The learning rate was decayed by a factor of 0.1 every 20 epochs. The embedding network, except for its first BatchNorm layer, was fine-tuned with a learning rate reduced by a factor of 1/10. During training, we stop the training process if the loss on the validation set exceeds the average of the previous 5 epochs. Unless specified otherwise, we report results using \(m_a=m_t=8\) for the number of action-centered and timestamp-centered prototypes, \(\lambda _1=\lambda _2=0.4, \lambda _3=0.2\) for the weighting factors, \(w_1=1, w_2=w_3=w_4=0.1\) for the loss weights.

Table 1 Results of 5-way 1-shot experiments on 6 dataset splits
Table 2 Result comparison of 5-way 5-shot experiments on 6 dataset splits

4.3 Results

Tables 1 and 2 show result comparison with baseline methods on the 1-shot and 5-shot settings. In the upper block of both tables, our model demonstrates superior performance compared to previous works on 5 of the 6 dataset splits. This suggests that compared with other prototype generation methods, our proposed approach to generate and match the compound prototypes enables better similarity measurement for few-shot action recognition. Specifically, our method can consistently outperform STRM (Thatipelli et al., 2021) and Nguyen et al. (2022) that also leverages both the spatial relations and the temporal relations, indicating that jointly considering the spatial and temporal relations can generate more detailed prototypes, and our carefully designed prototype matching method can have a great impact on the few-shot action recognition performance. We will give a further analysis of how each of the prototype generation step and the prototype matching step contributes to the final result in the following of this section. From the comparison between our method and our previous version “ours-” (Huang et al., 2022), the effect of spatiotemporal modeling is demonstrated. In the ablation study, we show even better performance can be achieved by carefully adjusting the number of prototypes \(m_a\) and \(m_t\). Overall, our findings indicate that our proposed compound prototype generation and matching approach significantly contribute to the improved performance in few-shot action recognition.

Our method does not achieve state-of-the-art results on the UCF dataset. One possible reason for this is that the classes in the UCF dataset can be easily distinguished based solely on appearance. Thus techniques such as contrastive learning in MoLo (Wang et al., 2023a) can deliver excellent performance gain. The easy dataset also causes our mm-encoder to overfit. we experimented with removing the mm-encoder and directly utilizing the concatenation of local and global features as input to the compound prototype decoder. Interestingly, this modification allowed our method to achieve state-of-the-art accuracy of 86.8% under the 1-shot setting on the UCF dataset. This observation highlights a limitation of our method, namely its susceptibility to overfitting on simple datasets.

Table 3 Result comparison of 5-way 1-shot, and 5-shot experiments on the EpicKitchens-Fine dataset

For the experiments conducted on the EPIC-Kitchens dataset, it is important to note that the previous method (Wang et al., 2022) only considered the recognition of the coarse verb labels in the dataset. The comparison of results with this previous method is provided in Table 1. To provide a more comprehensive evaluation of the models’ ability in fine-grained action recognition, we created a new dataset split using the EPIC-Kitchens dataset (Damen et al., 2018). This split, which is referred to as EpicKitchens-Fine, specifically focuses on fine-grained action recognition. We test the performance of our method on this new split to assess its effectiveness in capturing and recognizing the detailed nuances of actions within the EpicKitchens dataset. Meanwhile, since the fine-grained action is defined as the combination of a verb and a noun, we can better see the models’ ability to quickly and accurately capture novel verbs and novel nouns in the few-shot setting.

Results on the EpicKichens-Fine can be seen in Table 3. Because of the fine-grained annotation, where an action is labeled as a “verb-noun” pair, we can also further analyze the model performance when the noun is seen in the training but the verb is not (verb unseen), and vice versa (noun unseen), and the case where neither the verb nor the noun is seen in the training set (both unseen). There are a total of 6,25,5 categories in each of these cases. From the Table, it is clear that our method outperforms all previous methods. Importantly, the gaps between our method and previous work on the “noun unseen” and the “both unseen” cases are large. This indicates that our method has stronger capability in modeling the motion in the video, which is critical in the field of video action recognition.

Table 4 Results comparison of different methods when using/not using multi-level information, and using/not using our proposed encoder
Table 5 Results comparison of our model using different encoding relations with different numbers of action-centered prototypes, timestamp-centered prototypes, and summarized prototypes

4.4 Ablation Study

4.4.1 Effect of Multi-level Features

The use of multi-level features is essential for our model to capture both spatial and temporal information jointly in the video, enabling the finer representation ability of the features and thus the better results in the few-shot action recognition. One may argue that the performance gain is simply because of the use of multi-level features. To inspect the source of performance improvement, we test the performance of multiple methods with and without using multi-level information, and also with and without using our multi-relation encoder.

In Table 4, we present the comparative analysis of different encoder configurations. In the first block, no multi-level features or our multi-relation encoder are employed. The second block utilizes our multi-relation encoder but only considers the global-global relation \(RET_{aa}\). Comparing these two blocks, we observe that our multi-relation encoder can improve the performance of all methods, however not significantly. In the third block of Table 4, we concatenate each frame-wise feature with its corresponding multi-level features as input. The comparison between this block and the first block demonstrates the benefits of using multi-level features. Specifically, the addition of multi-level features leads to improvements of approximately 1–3% on the SSv2 dataset and 0.3–5% on the Kinetics dataset. Finally, in the fourth block of Table 4, both multi-level features and our multi-relation encoder are used. Comparing this block with the second and third blocks, we observe that all methods get more improvement. This indicates that incorporating our multi-relation encoder to consider multiple relations across frames allows for better utilization of the information provided by the multi-level features. Among all methods in the fourth block of Table 4, our method enjoys the most significant performance gain. This suggests that while the multi-level features bring additional information, our method excels at leveraging this information to enhance few-shot action recognition performance.

4.4.2 Impact of Multi-relation Feature Encoding

To investigate how does the multi-relation feature encoding contribute to the performance, we conduct an ablation study where we consider only subsets of \(\{\varvec{F}_{aa}, \varvec{F}_{ap}, \varvec{F}_{pp}\}\). Additionally, we vary the number of action-centered prototypes \(m_a\), timestamp-centered prototypes \(m_t\) and summarized prototypes \(m_s\) to assess the influence of feature encoding on each group of prototypes.

Results can be seen in Table 5. From the experiments with \(m_a=m_t=8\), the SSv2-Full dataset benefits significantly from the inclusion of the part-part feature \(\varvec{F}_{pp}\), while the Kinetics dataset shows greater improvement when utilizing the global-global feature \(\varvec{F}_{aa}\). This observation is reasonable since the SSv2 dataset includes a larger variety of actions involving the motion of objects. From the experiments with \(m_a=16\), the action-centered prototypes seem to work similarly well using the three encoded features on both datasets. The experiments with \(m_t=16\) suggest that timestamp-centered prototypes work better with global-global relations. When using all three features (last row of each block), the comparison between different choices of \(m_a\) and \(m_t\) reveals that the two groups of prototypes capture complementary aspects of the action. This is evident from the significant performance improvements observed when both groups of prototypes are present.

4.5 Class Improvement of Multi-relation Encoding

To analyze the impact of each different relation in our mm-encoder, we measure the class-level improvement when encoding all 3 relations (global-global, global-part, part-part) compared with encoding only one of the 3 relations. In Fig. 3, yellow bars show the class accuracy difference between encoding all relations and encoding only global-global relation. Blue and green bars indicate the difference between encoding all relations and encoding only global-part relation, encoding part-part relation, respectively. We can see that generally, part-part relation can greatly help to improve the classification performance. However, for specific classes such as “tipping something over”, “pushing something from right to left”, and “pulling something from left to right”, the combination of all three relations demonstrates the most significant benefit in terms of accuracy improvement. These findings highlight the importance of considering multiple relations in our mm-encoder, and show how the different relations contribute to capturing the nuances of various actions.

Fig. 3
figure 3

Class accuracy improvement when encoding 3 relations compared with encoding only one relation. The colors indicate the improvement when compared with: global-global only (yellow), global-part only (blue), and part-part only (green) (Color figure online)

4.6 Analysis of Compound Prototypes

The core of our proposed method is the generation and matching of compound prototypes. In this section, we delve into a more comprehensive analysis of the two groups of prototypes. Through a series of extensive experiments, we aim to gain a deeper understanding of their characteristics and their impact on few-shot action recognition performance.

Fig. 4
figure 4

Visualization of the self-attention weight of two action-centered prototypes and two timestamp-centered prototypes on each of the 8 timestamps of the input. Attention weights higher than average (0.125) are marked in black. We can see the action-centered prototypes capture a certain aspect of the action in the video, regardless of temporal location: \(\varvec{p}_{a,2}\)—the start of the action; \(\varvec{p}_{a,6}\) - the frames without hand. Meanwhile, the timestamp-centered prototypes mainly attend on fixed timestamps of the video: \(\varvec{p}_{t,1}\)—the end of the video; \(\varvec{p}_{t,2}\)—the middle part of the video. The example to the left comes from the SSv2-Small dataset and the example to the right is from SSv2-Full. Video similarity scores s and similarity scores of matched prototypes \(\varvec{p}_*\sim \varvec{p}_*\) are shown at the bottom

Fig. 5
figure 5

Visualization of the self-attention weight of two action-centered prototypes and two timestamp-centered prototypes on each of the 8 timestamps of the input. Attention weights higher than average (0.125) are marked in black. The example to the left comes from the SSv2-Full dataset and the example to the right is from Kinetics. The right example shows a wrong prediction where our model wrongly predicts the query to be the class of “unboxing”. Video similarity scores s and similarity scores of matched prototypes \(\varvec{p}_*\sim \varvec{p}_*\) are shown at the bottom

4.6.1 Comparison Between Different Matching Methods

Recent works also design different methods that can better match the prototypes for few-shot action recognition. Table 4 shows the comparison between different matching methods (the column of “Prototype”) with the same input. In the first block when using only the backbone feature, we can see that our method does not outperform recent competitive works HyRSM and MoLo. Our method also performs only comparably well with these two methods when only using our encoder without the multi-level features, or when simply concatenating the multi-level features as input. Cooperating with our MM-encoder, in the fourth block the maximum potential of our method can be stimulated, resulting in superior performance compared to all previous methods.

4.6.2 What Aspect of the Action Does Each Prototype Capture?

To better understand the generated prototypes and how they are associated with the action in the video, we investigate the self-attention operation that generates the prototypes. From Eq. 5, the self-attention weight \(\varvec{\alpha }\in \mathbb {R}^{(B+2)\times T}\) on each frame represents the amount of information each prototype gathers from that specific part of the video. A larger attention weight indicates that a prototype focuses more on that frame for generating its representation. We conduct experiments on four 1-shot 2-way examples and visualize this attention in Figs. 4 and 5. In the visualization we average the attention weights in each frame, forming \(\tilde{\varvec{\alpha }} \in \mathbb {R}^{T}\), and show this averaged value on each of the \(T=8\) frames. For clarity we only show 2 action-centered prototypes and 2 timestamp-centered prototypes in each video. Additionally, we provide the video similarity scores and the similarity of matched prototypes at the bottom of the figure for reference.

We first focus on the visualization shown in Fig. 4. From the figure, we observe distinct attention patterns for both action-centered prototypes and timestamp-centered prototypes in all videos. The action-centered prototype \(\varvec{p}_{a,2}\) consistently exhibits high attention weights on the start of the action (not the start of the video), and \(\varvec{p}_{a,6}\) tends to pay more attention to the frames that contain appearance changes compared to other frames (no hand existence). This behavior is aligned with our expectations, as the diversity loss \(L_{div}\) enforces each action-centered prototype to capture different aspects of the action. Meanwhile, the 1-to-1 matching encourages each action-centered prototype to focus on similar aspects, which enables correct video similarity prediction. Regarding the timestamp-centered prototypes, \(\varvec{p}_{t,1}\) often gives high attention to the last few frames, and \(\varvec{p}_{t,2}\) focuses more on the middle frames. This behavior is also in line with our expectations due to the attention loss \(L_{att}\), which prevents the timestamp-centered prototypes from attending to similar temporal locations. The bipartite matching further allows similar actions to be matched even when they are at different temporal positions in the videos. Similar patterns have been observed in the object detection task (Carion et al., 2020), where each object query focuses on detecting objects in specific spatial locations of the image.

In the left example, both pairs of action-centered prototypes, \(<\varvec{p}_{a,2}^a,\varvec{p}_{a,2}^b>\) and \(<\varvec{p}_{a,6}^a,\varvec{p}_{a,6}^b>\), exhibit high similarity scores (0.56 and 0.52 shown at the bottom of the left example). This indicates that videos a and b have similar starts, and the intra-video appearance changes are also similar. As a result, the query action is correctly classified as “Pretending to take something from somewhere”. In the right example, we can observe the effectiveness of the timestamp-centered prototypes. By using bipartite matching, \(\varvec{p}_{t,1}^x\) is matched with \(\varvec{p}_{t,2}^y\). Since both of these prototypes encode frames where the hands just tip the objects over, they give high similarities to each other, leading to the correct recognition of the query action as “Tipping something over”. This demonstrates that the timestamp-centered prototypes with bipartite matching can successfully handle temporal variations and shifts in actions, allowing for accurate action recognition even when the actions occur at different temporal positions in the videos.

In Fig. 5, we present additional visualization examples. In the example to the left, timestamp-centered prototypes \(\varvec{p}_{t,2}^a\) and \(\varvec{p}_{t,1}^b\) are matched and a high similarity score (0.78) is assigned. From the figure, we can observe that \(\varvec{p}_{t,1}^b\) has high attention on the ending frames, while \(\varvec{p}_{t,2}^a\) focus more on the middle frames. The focus of both prototypes captures the moment when the object is being taken out of the container, thus high similarity score is given. In the example to the right, our model wrongly classifies the query action of “folding paper” as “unboxing”. This misclassification is due to the high similarity score between \(\varvec{p}_{t,2}^x\) and \(\varvec{p}_{t,2}^z\). Upon manual inspection, we found that in the last half of video z, the recorder was reading a book taken out of the box. Since reading a book contains actions similar to “folding paper”, this similarity led to the incorrect prediction.

Fig. 6
figure 6

Visualization of self-attention weights of \(\varvec{P}_a\) and \(\varvec{P}_t\) for all samples on the test set of SSv2-Full and Kinetics datasets

Fig. 7
figure 7

Performance on the SSv2-Full and Kinetics datasets when changing the number of action-centered/timestamp-centered/summarized prototypes

In Fig. 6, we present a statistical analysis of the self-attention weights, specifically focusing on the average response of the first 4 action-centered prototypes and the first 4 timestamp-centered prototypes across all videos in the test set. As a result of the loss functions \(L_{div},L_{att}\) and the matching strategies, we observe distinct attention patterns for \(\varvec{P}_a\) (top 4 rows) and \(\varvec{P}_t\) (bottom 4 rows). \(\varvec{P}_a\) (first 4 rows) have a more uniform attention distribution, while \(\varvec{P}_t\) have obvious temporal regions of focus. The diversity of the prototypes ensures a robust representation of the videos, thus similarity between videos can be better computed during the few-shot learning process.

4.6.3 How Much Does Each Group of Prototype Contribute?

To find the answer, we test our method using different numbers of prototypes (\(m_a\) and \(m_t\)) and show the results in Fig. 7. Apparently, our method performs better when \(m_s=1\) compared with \(m_s=0\), indicating the effectiveness of the summarized prototype. Across all four sub-graphs, we observe a consistent trend where the performance improves as the number of prototypes increases. This trend demonstrates that the prototypes play a crucial role in representing and recognizing actions in the few-shot setting. However, it is worth noting that after reaching a certain threshold, further increasing the number of prototypes leads to diminishing performance gains due to overfitting on the training data. The optimal combination of \(m_a\) and \(m_t\) varies for each dataset. Best results on both SSv2-Full (52.3) and Kinetics (74.3) datasets are achieved when \(m_t\) is larger than or equal to \(m_a\). Although \(m_a=m_t=8\) is not the optimal setting, we apply this setting in Sects. 4.3 and 4.4 since it is the most stable setting on all datasets. A method to automatically choose the number of prototypes is left for our future work.

Fig. 8
figure 8

Class accuracy improvement when our method uses \(m_a=m_t=8\) prototypes compared to: orange bars: \(m_a=16\), \(m_t=0\); blue bars: \(m_a=0\), \(m_t=16\). S stands for the abbreviation of “something” (Color figure online)

In Fig. 8, we present the class accuracy improvement when our method uses both groups of prototypes compared with our method using only one group of prototype. In Fig. 8, orange bars represent the accuracy difference between the \(m_a=m_t=8\) setting (using both groups of prototypes) and the \(m_a=16, m_t=0\) setting (using only action-centered prototypes). This indicates the performance gain from introducing the timestamp-centered prototypes. On the other hand, the blue bars denote the accuracy improvement brought by the action-centered prototypes. In the case of the SSv2-Full dataset, we can observe that the combination of both groups of prototypes provides significant performance improvements for challenging classes such as “pulling S out of S”, “pulling S from left to right”, and “pushing S from right to left”. On the Kinetics dataset, we observe that the timestamp-centered prototypes are more effective. This can be attributed to the dataset’s emphasis on appearance-based features, which can be better captured and compared using the timestamp-centered prototypes.

To elucidate the impact of our summarized prototypes, we establish an experiment of timestamp-centered prototypes with and without the regularization of the summarized prototypes. It is important to note that, to isolate the effect of the summarized prototypes, we deliberately excluded any action-centered prototypes from this experimental configuration. From the overall recognition accuracy, with the summarized prototypes, the accuracy on SSv2-full and Kinetics rose from 44.9 to 45.2, and 72.5 to 72.9, respectively. This increment, although modest, is indicative of the positive role played by the summarized prototypes in enhancing model performance.

We also visualize the attention score of the timestamp-centered prototypes on each frame with and without the regularization from the summarized prototypes on the Kinetics dataset. As depicted in Fig. 9, the visualization from the Kinetics dataset demonstrates that the highest attention weight is more pronounced when the regularization from the summarized prototypes is applied. This observation suggests that the inclusion of summarized prototypes not only improves accuracy but also results in more focused and defined attention within the model.

4.7 Analysis of Bipartite Matching

Our method with includes three groups of prototypes, where the timestamp-centered prototypes are matched via bipartite matching. Although we observe great performance gain brought by \(\varvec{P}_t\) in Table 5 and Fig. 8, the bipartite matching will unavoidably produce some wrongly matched prototype pairs when creating the correct matchings.

However, we found that these wrong matchings are necessary for training the model. One reason for this is the use of positional encoding, which implicitly encodes the temporal ordering of frames within each prototype. During training, the similarity scores of the incorrectly matched pairs are learned to be small, ensuring that the final similarity score is predominantly influenced by the similarity scores of the correctly matched prototype pairs. Since our decoder is based on a transformer architecture, each prototype is generated by attending to all frames in the video. With the positional encoding, the generated prototypes implicitly capture the temporal ordering of the entire video but with varying emphasis. For example, a prototype may attend to all frames but place greater emphasis on specific frames. Through the training process, the optimization encourages the correct matchings to generate high similarity scores while keeping the similarity scores of the incorrect matchings low.

Fig. 9
figure 9

The attention weight of each timestamp-centered prototype to each frame with (right) and without (left) the regularization of the summarized prototypes

To test whether filtering out a subset of the matchings will also get good results, we conduct the following experiment with different methods to filter the timestamp-centered prototypes. Specifically, since not all matched prototypes are guaranteed to be correct, we applied the following thresholds for filtering: (1) Best match: Only the prototype with the highest similarity score is considered as the overall similarity score between two videos, (2) Above average: Only the matched prototypes with similarity scores greater than the average similarity score are considered. To allow the gradient flow over the non-matched prototypes, we use a leaky relu-like method to suppress the scores of the non-matched prototypes by:

$$\begin{aligned} \text{ score } = {\left\{ \begin{array}{ll} \text{ score } &{} \text{ if } \text{ matching } \text{ meets } \text{ threshold } \\ p * \text{ score } &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(15)

Here, if \(p=0\), matchings that do not meet the threshold are discarded. If \(0<p<1\), these matchings are suppressed. We equip this filtering method and test it with different values of p and show the results in Table 6.

From Table 6, we can see that if we suppress (\(p=0.5\)) or ignore (\(p=0)\) a subset of matched prototypes, our method cannot perform well. This is because we find it becomes even hard for the model to fit on the training set. This proves our claim that the wrong matchings are essential to the training process of our model.

Table 6 Result comparison of 5-way 1-shot experiments when using different thresholds to filter the matched timestamp-centered prototypes
Table 7 Results comparison of different methods when using different types of features as input

4.7.1 Using Object Features

In our previous method (Huang et al., 2022), we used off-the-shelf object detectors to extract object features and use them as local features. In this work, it is also possible to replace our multi-level features into object features. In Table 7 we show the result comparison when methods are equipped with different feature as inputs. In the table, single-level feature refers to the feature obtained after global average pooling. Multi-level feature represents the proposed feature input in this work, while the “Object” feature indicates input comprising both the feature after global average pooling and the three most confident object features. From the table, it is obvious that object features can effectively provide information that greatly helps the classification. From the comparison of each method using/not using object or multi-level features, we can see that other methods cannot fully exploit the information brought by the object or multi-level features. When multi-level features are added as input, our full model can outperforms previous methods that use the same input. When object features are used, the potential of our proposed compound prototype matching scheme is significantly stimulated, by improving the accuracy on the SSv2 dataset for over 10% and the Kinetics dataset for 6%.

Table 8 Comparison of performance and detailed model specifics including the number of parameters (# Params), FLOPS, and inference time in seconds/iteration between different backbones and methods

The incorporation of object detectors in few-shot action recognition models, while benefting performance, inevitably escalates the model’s parameter count, thus impacting its overall efficiency. This raises an important question: to what extent do these added parameters contribute to the observed performance enhancements? In our study, we investigate this by comparing the parameter count across various methods and examining their corresponding performances on the SSv2-Small (SSv2-S), SSv2-full (SSv2-full), and Kinetics datasets.

Table 8 shows detailed specifics of different models with different backbones under the 5-way, 1-shot problem. This table reveals that our method, employing the same ResNet-50 backbone, has a higher parameter count (39.8M) compared to other methods, which range between 24 to 30 M parameters. Notably, this increase in parameters correlates with a significant improvement in performance. However, to discern whether the performance gains are predominantly due to the increased number of parameters, we modified our method’s backbone to ResNet-18 and DenseNet-121, which have fewer parameters. This adjustment resulted in a total parameter count comparable to the previous methods. Interestingly, even with a similar parameter count, our method with multiscale features continues to surpass the performance of previous works. This observation strongly underscores the efficacy of our proposed prototype matching technique, beyond the mere increase in model size. In the case of the object detector, we note that it adds 41.8M parameters, a substantial increment while also adding a great amount of computation cost in terms of FLOPS. Correspondingly, its utilization yields a significant boost in performance.

To conclude, while the use of object detectors in few-shot action recognition models indeed brings about considerable improvements in performance, it does so at the cost of increased model complexity and reduced efficiency. This trade-off highlights the need for a balanced approach in model design, where the benefits of enhanced performance are carefully weighed against the drawbacks of increased computational demands.

4.8 Impact of Changing Other Parameters

In this section, we explore the impact of changing other parameters in our model.

4.8.1 Impact of l-Level Features

Our method incorporates an average pooling operation with a default of \(l=2\) levels of pooling, using pooling window sizes of (2, 3). In this analysis, we investigate the impact of varying the value of l on the overall performance. Table 9 shows the performance comparison when varying l from 1 to 3 while also adjusting the window size of pooling. We can see from the table that the best results on the SSv2 dataset are achieved when \(l=3\), while on the Kinetics dataset \(l=2\) performs better. Taking into consideration the trade-off between performance and computational cost, we select \(l=2\) with window sizes (2, 3) as the default choice for our experiments.

Table 9 Results comparison of changing different l and different window sizes during the pooling operation
Fig. 10
figure 10

Comparison on different backbones applied on three methods: TRX (Perrett et al., 2021), Ours- (Huang et al., 2022), and Ours

4.8.2 Changing Backbones

Our proposed method is compatible with feature extractors of various capacities. Conventionally, all previous methods use ResNet-50 as their backbones for a fair comparison. As a result, the impact of different backbones on performance is still under-explored. In Fig. 10, we show the performance comparison of three methods when using ResNet-50 or Swin-Transformer (Liu et al., 2021) as backbones. Results demonstrate that the use of swin-transformer can increase the performance on the Kinetics dataset very significantly, however, has only marginal performance influence on the SSv2 dataset. We believe that this is because the swin-transformer has a stronger ability to represent the image content, thus the Kinetics dataset, which focuses more on appearance than motion, can enjoy a larger performance gain. In addition, our method can consistently outperform the competitors, indicating the ability of our method to work with different backbones.

4.8.3 Impact of Positional Encoding (PE)

In our method, we employ positional encoding (PE) to encode the temporal and spatial positional information in the feature embeddings. In our method, for \(\varvec{F}_{a}\) we use 1D PE to encode the temporal location of each frame. For \(\varvec{F}_{p}\) we apply 3D PE that incorporates spatial and temporal information. We conduct an ablation study to test the impact of positional encoding on the SSv2-Full and Kinetics datasets. We add another comparison of reduced positional encoding PE\(^\sharp \), where 1D PE is applied for both \(\varvec{F}_{a}\) and \(\varvec{F}_{p}\). Results can be found in Table 10. The improvement is relatively marginal but still, we can see the usefulness brought by joint 1D and 3D positional encoding.

Table 10 Result comparison of 5-way 1-shot experiments with different positional encoding methods
Table 11 Results of 5-way 1-shot (top) and 5-shot (bottom) experiments on 6 dataset splits

4.8.4 Results with Additional Loss on Semantic Labels

Recently, some works (Wang et al., 2022; Xing et al., 2023a; Wang et al., 2023a) used the semantic class labels of the training set as additional supervision during training. In this setting, an extra classification layer with cross-entropy loss is added to supervise the learning of the backbone encoder for better discriminating the classes in the support set. While this setting may not be the standard few-shot recognition setting, for a fair comparison we also implement this setting and compare our method with previous works in Table 11. From the table, we can still see the superiority of our proposed method against the state-of-the-art.

5 Conclusion

In this work, we introduce a novel method for few-shot action recognition by generating action-centered, timestamp-centered and summarized prototypes and compare video similarity based on the prototypes. When generating the prototypes, we encode spatiotemporal multi-level relations to address the actions that involve motion in different parts of the video. The three groups of prototypes are encouraged to capture different aspects of the input by different loss functions and matching strategies. Our method achieves state-of-the-art performance on multiple datasets. In our future work, we will explore a more flexible prototype matching strategy that can avoid the mismatch in the bipartite matching. How to better initialize the three types of prototype tokens is also a promising future direction.