Meet JEANIE: a Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment

Video sequences exhibit significant nuisance variations (undesired effects) of speed of actions, temporal locations, and subjects' poses, leading to temporal-viewpoint misalignment when comparing two sets of frames or evaluating the similarity of two sequences. Thus, we propose Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE) for sequence pairs. In particular, we focus on 3D skeleton sequences whose camera and subjects' poses can be easily manipulated in 3D. We evaluate JEANIE on skeletal Few-shot Action Recognition (FSAR), where matching well temporal blocks (temporal chunks that make up a sequence) of support-query sequence pairs (by factoring out nuisance variations) is essential due to limited samples of novel classes. Given a query sequence, we create its several views by simulating several camera locations. For a support sequence, we match it with view-simulated query sequences, as in the popular Dynamic Time Warping (DTW). Specifically, each support temporal block can be matched to the query temporal block with the same or adjacent (next) temporal index, and adjacent camera views to achieve joint local temporal-viewpoint warping. JEANIE selects the smallest distance among matching paths with different temporal-viewpoint warping patterns, an advantage over DTW which only performs temporal alignment. We also propose an unsupervised FSAR akin to clustering of sequences with JEANIE as a distance measure. JEANIE achieves state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II on supervised and unsupervised FSAR, and their meta-learning inspired fusion.


Introduction
Action recognition is a key topic in computer vision, with applications in video surveillance [105,109,120], human-computer interaction, sport analysis and robotics.Many pipelines [99,24,23,9,50,61,108,49,114,117,106,83,145,120,58] perform (action) classification given a large amount of labeled training data.However, manually labeling videos for 3D skeleton sequences is laborious, and such pipelines need to be retrained or finetuned for new class concepts.Popular action recognition networks such as the two-stream neural network [24,23,124] and 3D Convolutional Neural Network (3D CNN) [99,9] aggregate framewise and temporal block representations, respectively.However, such networks are trained on large-scale datasets such as Kinetics [9,116,110,118] under a fixed set of training classes.
Thus, there exists a growing interest in devising effective Few-shot Learning (FSL) models for action recognition, termed Few-shot Action Recognition (FSAR), that rapidly adapt to novel classes given few training samples [77,129,31,19,138,7,112].FSAR models are scarce due to the volumetric nature of videos and large intra-class variations.
Video sequences may be captured under varying camera poses where subjects may follow different trajectories resulting in subjects' pose variations.Variations of action speed, location, and motion dynamics are also common.Yet, FSAR has to learn and infer similarity between supportquery sequence pairs under the limited number of samples of novel classes.Thus, a good measure of similarity between support-query sequence pairs has to factor out the above variations.To this end, we propose a FSAR model that learns on skeleton-based 3D body joints via Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE).We focus on 3D skeleton sequences as camera/subject's pose can be easily altered in 3D by the use of projective camera geometry.JEANIE achieves good matching of queries with support sequences by simultaneously modeling the optimal (i) temporal and (ii) viewpoint alignments.To this end, we build on soft-DTW [16], a differentiable variant of Dynamic Time Warping (DTW) [15] (Fig. 5 is an overview how DTW differs from the Euclidean distance).Given a query sequence, we create its several views by simulating several camera locations.For a support sequence, we can match it with viewsimulated query sequences as in DTW.Specifically, with the goal of computing optimal distance, each support tem-poral block1 can be matched to the query temporal block with the same temporal block index or neighbouring temporal block index to perform a local time warping step.However, we simultaneously also let each support temporal block match across adjacent camera views of the query temporal block to achieve camera viewpoint warping.Multiple alignment patterns of query and support blocks result in multiple paths across temporal and viewpoint modes.Thus, each path represents a matching plan describing between which support-query block pairs the feature distances are evaluated and aggregated.By the use of soft-minimum, the path with the minimum aggregated distance is selected as the output of JEANIE.Thus, while DTW provides optimal temporal alignment of support-query sequence pairs, JEANIE simultaneously provides the optimal joint temporal-viewpoint alignment.
To facilitate the viewpoint alignment in JEANIE, we use easy 3D geometric operations.Specifically, we obtain skeletons under several viewpoints by rotating skeletons (zerocentered by hip) via Euler angles [1], or generating skeleton locations given simulated camera positions, according to the algebra of stereo projections [2].
We note that view-adaptive models for action recognition do exist.View Adaptive Recurrent Neural Network [139,140] is a classification model equipped with a view-adaptive subnetwork that contains the rotation/translation switches within its RNN backbone and the main LSTM-based network.Temporal Segment Network [119] models long-range temporal structures with a new segment-based sampling and aggregation module.However, such pipelines require a large number of training samples with varying viewpoints and temporal shifts to learn a robust model.Their limitations become evident when a network trained under a fixed set of ReLU ReLU = Fig.3: Our 3D skeleton-based FSAR with JEANIE.Frames from a query sequence and a support sequence are split into short-term temporal blocks X 1 , ..., X τ and X ′ 1 , ..., X ′ τ ′ of length M given stride S. Subsequently, we generate (i) multiple rotations by (∆θ x , ∆θ y ) of each query skeleton by either Euler angles (baseline approach) or (ii) simulated camera views (gray cameras) by camera shifts (∆θ az , ∆θ alt ) w.r.t. the assumed average camera location (black camera).We pass all skeletons via Encoding Network (with an optional transformer) to obtain feature tensors Ψ and Ψ ′ , which are directed to JEANIE.We note that the temporal-viewpoint alignment takes place in 4D space (we show a 3D case with three views: −30 • , 0 • , 30 • ).Temporally-wise, JEANIE starts from the same t = (1, 1) and finishes at t = (τ, τ ′ ) (as in DTW).Viewpointwise, JEANIE starts from every possible camera shift ∆θ ∈ {−30 • , 0 • , 30 • } (we do not know the true correct pose) and finishes at one of possible camera shifts.At each step, the path may move by no more than (±∆θ az , ±∆θ alt ) to prevent erroneous alignments.Finally, SoftMin picks up the smallest distance.
action classes has to be adapted to samples of novel classes.Our JEANIE does not suffer from such a limitation.
Figure 1 is a simplified overview of our pipeline which can serve as a template for baseline FSAR.It shows that our pipeline consists of an MLP which takes neighboring frames forming a temporal block.Each sequence consists of several such temporal blocks.However, as in Figure 2, we sample desired Euler rotations or simulated camera viewpoints, generate multiple skeleton views, and pass them to the MLP to get block-wise feature maps fed into a Graph Neural Network (GNN) [42,94,127,43,125,154,150,149].We mainly use a linear S 2 GC [154,158,156,115], with an optional transformer [17], and an FC layer to obtain block feature vectors passed to JEANIE whose output distance measurements flow into our similarity classifier.Figure 3 is a detailed overview of our supervised FSAR pipeline.
Note that JEANIE can be thought of as a kernel in Reproducing Kernel Hilbert Spaces (RKHS) [90] based on Optimal Transport [102] with a specific temporal-viewpoint transportation plan.As kernels capture the similarity of sample pairs instead of modeling class labels, they are a natural choice for FSL and FSAR problems.
In this paper, we extend our supervised FSAR model [111] by introducing an unsupervised FSAR model, and a fusion of both supervised and unsupervised models.Our rationale for an unsupervised FSAR extension is to demonstrate that the invariance properties of JEANIE (dealing with temporal and viewpoint variations) help naturally match sequences of the same class without the use of additional knowledge (class labels).Such a setting demonstrates that JEANIE is able to limit intra-class variations (temporal and viewpoint variations) facilitating unsupervised matching of sequences.
For unsupervised FSAR, JEANIE is used as a distance measure in the feature reconstruction term of dictionary learning and feature coding steps.Features of the temporal blocks are projected into such a dictionary space and the projection codes representing sequences are used for similarity measure between support-query sequences.This idea is similar to clustering training sequences into k-means clusters [14] to form a dictionary.Then the assignments of test query sequences to such a dictionary can reveal their class labels based on labeled test support sequence falling into the same cluster.However, even with JEANIE used as a distance measure, one-hot assignments resulting from k-means are suboptimal.Thus, we investigate more recent soft assignment [6,27,46,64] and sparse coding approaches [54,132].
Finally, we also introduce a simple fusion of supervised and unsupervised FSAR by alignment of supervised and unsupervised FSAR features or by MAML-inspired [26] fusion of unsupervised and supervised FSAR losses in the socalled inner and outer loop, respectively.Below are our contributions: i.We propose JEANIE that performs the joint alignment of temporal blocks and simulated camera viewpoints of 3D skeletons between support-query sequences to select the optimal alignment path which realizes joint temporal (time) and viewpoint warping.We evaluate JEANIE on skeletal few-shot action recognition, where matching correctly support and query sequence pairs (by factoring out nuisance variations) is essential due to limited samples representing novel classes.ii.To simulate different camera locations for 3D skeleton sequences, we consider rotating them (1) by Euler angles within a specified range along axes, or (2) towards the simulated camera locations based on the algebra of stereo projection.iii.We propose unsupervised FSAR where JEANIE is used as a distance measure in the feature reconstruction term of dictionary learning and coding steps (we investigate several such coders).We use projection codes to represent sequences.Moreover, we also introduce an effective fusion of both supervised and unsupervised FSAR models by unsupervised and supervised feature alignment term or MAML-inspired fusion of unsupervised and supervised FSAR losses.iv.As minor contributions, we investigate different GNN backbones (combined with an optional transformer), as well as the optimal temporal size and stride for temporal blocks encoded by a simple 3-layer MLP unit before forwarding them to GNN.We also propose a simple similarity-based loss encouraging the alignment of within-class sequences and preventing the alignment of between-class sequences.

Related Works
Below, we describe 3D skeleton-based AR, FSAR approaches, and Graph Neural Networks.
However, such models rely on large-scale datasets to train large numbers of parameters, and cannot be adapted with ease to novel class concepts whereas FSAR can.
In contrast, we use temporal blocks of skeleton sequences encoded by GNNs under multiple simulated camera viewpoints to jointly apply temporal and viewpoint alignment of query-support sequences to factor out nuisance variability.
Graph Neural Networks.GNNs modified to act on the specific structure of 3D skeletal data are very popular in action recognition, as detailed in "Action recognition (3D skeletons)" at the beginning of Section 2. In this paper, we leverage standard GNNs due to their good ability to represent graph-structured data.GCN [42] applies graph convolution in the spectral domain, and enjoys the depth-efficiency when stacking multiple layers due to non-linearities.However, depth-efficiency extends the runtime due to backpropagation through consecutive layers.In contrast, a very recent family of so-called spectral filters do not require depth-efficiency but apply filters based on heat diffusion on graph adjacency matrix.As a result, these are fast linear models as learnable weights act on filtered node representations.Unlike general GNNs, SGC [127], APPNP [43] and S 2 GC [154] are three such linear models which we investigate for the backbone, followed by an optional transformer, and an FC layer.
), and m(t) and n(t) parameterize query and support indexes.One is permitted steps ↓, ↘, → in the graph.We expect d DT W ≤ d Euclid .
In this work, we apply a simple optional transformer block with few layers following GNN to capture better blocklevel dependencies of 3D human body joints.
Multi-view action recognition.Multi-modal sensors enable multi-view action recognition [108,139].A Generative Multi-View Action Recognition framework [107] integrates RGB and depth data by View Correlation Discovery Network while Synthetic Humans [101] generates synthetic training data to improve generalization to unseen viewpoints.Some works use multiple views of the subject [86,62,140,107] to overcome the viewpoint variations for action recognition.Recently, a supervised contrastive learning framework [85] for multi-view was introduced.
In contrast, our JEANIE performs jointly the temporal and simulated viewpoint alignment in an end-to-end FSAR setting.This is a novel paradigm based on improving the notion of similarity between sequences of support-query pair rather than learning class concepts.

Approach
To learn similarity and dissimilarity between pairs of sequences of 3D body joints representing query and support samples from episodes, our goal is to find a joint viewpointtemporal alignment of query and support, and minimize or maximize the matching distance d JEANIE (end-to-end setting) for same or different support-query labels, respectively.Fig. 4 (top) shows that sometimes matching of query and support may be as easy as rotating one trajectory onto another, in order to achieve viewpoint invariance.A viewpoint invariant distance [33] can be defined as: where T is a set of transformations required to achieve a viewpoint invariance, d(•, •) is some base distance, e.g., the Euclidean distance, and Ψ and Ψ ′ are features describing query and support pair of sequences.Typically, T may include 3D rotations to rotate one trajectory onto the other.However, a global viewpoint alignment of two sequences is suboptimal.Trajectories are unlikely to be straight 2D lines in the 3D space so one may not be able to rotate the query trajectory to align with the support trajectory.Fig. 4 (bottom) shows that the subjects' poses locally follow complicated non-linear paths.Thus, we propose JEANIE that aligns and warps query / support sequences based on the feature similarity.One can think of JEANIE as performing Eq. (1) with T containing all possible combinations of local time-warping augmentations of sequences and camera pose augmentations for each frame (or temporal block).JEANIE unit in Fig. 3 realizes such a strategy.Figure 6 (discussed later in the text) shows one step of the temporal-viewpoint computations of JEANIE in search for optimal temporal-viewpoint alignment path between query and support sequences.Soft-minimum across all such possible alignment paths can be equivalently written as an infimum over a set of specific transformations in Eq. (1).Below, we detail our pipeline, and explain the proposed JEANIE, Encoding Network (EN), feature coding and dictionary learning, and our loss function.Firstly, we present our notations.Notations.I K stands for the index set {1, 2, ..., K}.Concatenation of α i is denoted by [α i ] i∈I I , whereas X :,i means we extract/access column i of matrix D. Calligraphic mathcal fonts denote tensors (e.g., D), capitalized bold symbols are matrices (e.g., D), lowercase bold symbols are vectors (e.g., ψ), and regular fonts denote scalars.
Prerequisites.Below we refer to prerequisites used in the subsequent chapters.Appendix A explains how Euler angles and stereo projections are used in simulating different skeleton viewpoints.Appendix B explains several GNN approaches that we use in our Encoding Network.Appendix C explains several feature coding and dictionary learning strategies which we use for unsupervised FSAR.

Encoding Network (EN)
We start by generating K×K ′ Euler rotations or K×K ′ simulated camera views (moved gradually from the estimated camera location) of query skeletons.Our EN contains a simple 3-layer MLP unit (FC, ReLU, FC, ReLU, Dropout, FC), GNN, optional Transformer [17] and FC.The MLP unit takes M neighboring frames, each with J 3D skeleton body joints, forming one temporal block X ∈ R 3×J×M , where 3 indicates 3D Cartesian coordinates.In total, depending on stride S, we obtain some τ temporal blocks which capture the short temporal dependency, whereas the long temporal dependency is modeled with our JEANIE.Each temporal block is encoded by the MLP into a d × J dimensional feature map: We obtain K × K ′ × τ query and τ ′ support feature maps, each of size J × d.Each maps is forwarded to a GNN.For S2 GC [154] (default GNN in our work) with L layers, we have: where S is the adjacency matrix capturing connectivity of body joints, whereas 0 ≤ α ≤ 1 controls the selfimportance of each body joint.Appendix B describes several GNN variants we experimented with: GCN [42], SGC [127], APPNP [43] and S 2 GC [154].
Optionally, a transformer 2 (described below in "Transformer Encoder") may be used.Finally, an FC layer returns Ψ ∈ R d ′ ×K×K ′ ×τ query feature maps and Ψ ′ ∈ R d ′ ×τ ′ support feature maps.Feature maps are passed to JEANIE whose output is passed into the similarity classifier.The whole Encoding Network is summarized as follows.Let support maps For M query and M support frames per block, X ∈ R 3×J×M and X ′ ∈ R 3×J×M .We also define: where is the set of parameters of EN (including an optional transformer).
Transformer Encoder.Vision transformer [17] consists of alternating layers of Multi-Head Self-Attention (MHSA) and a feed-forward MLP (2 FC layers with a GELU nonlinearity intertwined).LayerNorm (LN) is applied before every block, and residual connections after every block.If transformer is used, each feature matrix X ∈ R J×d per temporal block is encoded by a GNN into X ∈ R J×d and then passed to the transformer.Similarly to the standard transformer, we prepend a learnable vector y token ∈ R 1×d to the sequence of block features X obtained from GNN, and we also add the positional embeddings E pos ∈ R (1+J)×d based on the standard sine and cosine functions so that token y token and each body joint enjoy their own unique positional encoding.One can think of our GNN block as replacing the tokenizer linear projection layer of a standard transformer.Compared to the use of FC layer as linear projection layer, our GNN tokenizer in Eq. ( 5) enjoys (i) better embeddings of human body joints based on the graph structure (ii) no learnable parameters.From the tokenizer, we obtain Z 0 ∈ R (1+J)×d : and feed it into in the following transformer backbone: where Z Ltr is the first d-dimensional row vector extracted from the output matrix Z Ltr ∈ R (J+1)×d , and L tr controls the depth of the transformer (the number of layers), whereas 9) becomes equivalent of Eq. ( 4) with the transformer.

JEANIE
Prior to explaining the details of the JEANIE measure, we briefly explain details of soft-DTW.Soft-DTW [15,16].Dynamic Time Warping can be seen as a specialized "metric" with a matching transportation plan 3acting on the temporal mode of sequences.Soft-DTW is defined as: where The binary A ∈ A τ,τ ′ encodes a path within the transportation plan A τ,τ ′ which depends on lengths τ and τ ′ of sequences is the matrix of distances, evaluated for τ ×τ ′ frames (or temporal blocks) according to some base distance d base (•, •), i.e., the Euclidean distance.
In what follows, we make use of principles of soft-DTW, i.e., the property of time-warping.However, we design a joint alignment between temporal skeleton sequences and simulated skeleton viewpoints, which means we achieve joint time-viewpoint warping (a novel idea never done before).

JEANIE.
Matching query-support pairs requires temporal alignment due to potential offset in locations of discriminative parts of actions, and due to potentially different dynamics/speed of actions taking place.The same concerns the direction of actor's pose, i.e., consider the pose trajectory w.r.t. the camera.Thus, the JEANIE measure is equipped with an extended transportation plan A ′ ≡ A τ,τ ′ ,K,K ′ , where apart from temporal block counts τ and τ ′ , for query sequences, we have possible η az left and η az right steps from the initial camera azimuth, and η alt up and η alt down steps from the initial camera altitude.Thus, K = 2η az +1, K ′ = 2η alt +1.For the variant with Euler angles, we simply have The JEANIE formulation is given as: where and tensor D contains distances evaluated between all possible temporal blocks.Figure 6 illustrates one step of JEANIE.Suppose the given viewing angle set is For the current node at (t, t ′ , n) we evaluate, we have to aggregate its base distance with the smallest aggregated distance of its predecessor nodes.The "1-max shift" means that the predecessor node must be a direct neighbor of the current node (imagine that dots on a 3D grid are nodes connected by links).Thus, for 1-max shift, at location (t, t ′ , n), we extract the node's base distance and add it together with the minimum of aggregated distances at the shown 9 predecessor nodes.We store that aggregated distance at (t, t ′ , n), and we move to the next node.Note that for viewpoint index n, Algorithm 1 Joint tEmporal and cAmera viewpoiNt alIgn-mEnt (JEANIE).
for n ∈ {−η, ..., η}: 7: we look up (n − 1, n, n + 1) neighbors.Extension to the ιmax shift is straightforward.The importance of low value of ι-max shift, e.g., ι = 1 is that low value of ι promotes the so-called smoothness of alignment.That is, while time or viewpoint may be warped, they are not warped abruptly (e.g., the subject's pose is not allowed to suddenly rotate by 90 • in one step then rotate back by −90 • .This smoothness is the key preventing greedy matching that would result in an overoptimistic distance between two sequences.Algorithm 1 illustrates JEANIE.For brevity, let us tackle the camera viewpoint alignment along the azimuth, e.g., for some shifting steps −η, ..., η, each with size ∆θ az .The maximum viewpoint change from block to block is ιmax shift (smoothness).As we have no way to know the initial optimal camera shift, we initialize all possible origins of shifts in accumulator r n,1,1 = d base (ψ n,1 , ψ ′ 1 ) for all n ∈ {−η, ..., η}.Subsequently, steps related to soft-DTW (temporal-viewpoint matching) take place.Finally, we choose the path with the smallest distance over all possible viewpoint ends by selecting a soft-minimum over [r n,τ,τ ′ ] n∈{−η,...,η} .Notice that elements of the accumulator tensor R ∈ R (2ι+1)×τ ×τ ′ are accessed by writing r n,t,t ′ .Moreover, whenever either index n−i, t−j or t Free Viewpoint Matching (FVM).To ascertain whether JEANIE is better than performing separately the temporal and simulated viewpoint alignments, we introduce an important and plausible baseline called Free Viewpoint Matching.FVM, for every step of DTW, seeks the best local viewpoint alignment, thus realizing a non-smooth temporal-viewpoint path in contrast to JEANIE.To this end, we apply soft-DTW in Eq. ( 12) with the base distance replaced by: where are query and support feature maps.We abuse slightly the notation by writing d FVM(ψt,ψ ′ t ′ ) as we minimize over viewpoint indexes inside of Eq. ( 13).Thus, we calculate the distance matrix 7 shows the comparison between soft-DTW (viewwise), FVM and our JEANIE.FVM is a greedy matching method which leads to complex zigzag path in 3D space (we illustrate the camera viewpoint in a single mode, e.g., the azimuth for ψ n,t , and no viewpoint mode for ψ ′ t ′ ).Although FVM is able to produce the path with a smaller aggregated distance compared to soft-DTW and JEANIE, it suffers from obvious limitations: (i) It is unreasonable for poses in a given sequence to match under extreme sudden changes of viewpoints.(ii) Even if two sequences are from two different classes, FVM still yields the smallest distance (decreased inter-class variance).

Loss Function for Supervised FSAR
For the N -way Z-shot problem, we have one query feature map and N × Z support feature maps per episode.We form a mini-batch containing B episodes.Thus, we have query feature maps {Ψ b } b∈I B and support feature maps . Moreover, Ψ b and Ψ ′ b,1,: share the same class, one of N classes drawn per episode, forming the subset Selection of C ‡ per episode is random.For the N -way Z-shot protocol, we minimize: where

Feature Coding and Dictionary Learning for Unsupervised FSAR
Recall from Section 1 that unsupervised FSAR forms a dictionary from the training data without the use of labels.Assigning labeled test support samples and test query into cells of a dictionary lets infer the query label by associating query with the support sample (to paraphrase, if they share the same dictionary cell, they share the class label).
In this setting, we also use a mini-batch with B episodes.Thus, B query samples and BN Z support samples give the total of N ′ = B(N Z +1) samples per batch for feature coding and dictionary learning.Let dictionary M ∈ R d ′ •τ * ×k and dictionary-coded matrix A ≡ [α 1 , ..., α N ′ ] ∈ R k×N ′ .Let τ * be set as the average number of temporal blocks over training sequences.For dictionary M and some codes A, the reconstructed feature map is given as . In what follows we reshape the reconstructed feature map so that M A ∈ R d ′ ×τ * ×N ′ .The feature map per sequence is given as . All query and support sequences per batch form a set with N ′ feature maps which we select by writing Ψ i ∈ Υ where i = 1, ..., N ′ .They are obtained from Encoding Network the same way as for supervised FSAR except that both query and support sequences now are equipped with K × K ′ viewpoints.Algorithm 2 and Figure 8 illustrate unsupervised FSAR learning with JEANIE.In short, we minimize the following loss w.r.t.F, M and A by alternating over these variables: Fig. 8: Unsupervised FSAR uses the JEANIE measure as a distance between feature map Φ of a sequence and its dictionary-based reconstruction M α.LcSA performs feature coding to obtain dictionary-coded α.DL learns the dictionary M .
where  16) pursues the reconstruction of the feature map Ψ i by the linear combination of dictionary codewords, given as M α i .The reconstruction error d 2 JEANIE (Ψ i , M α i ) is encouraged to be small.However, unlike the Euclidean distance, JEANIE ensures temporal and viewpoint alignment of sequences Ψ i with the dictionary-based reconstruction M α i .Constraint Ω(α i , M , Ψ i ) is a regularization term depending on the selection of feature coding method.Such a regularization encourages discriminative description, i.e., similar and different feature vectors obtain similar and different dictionary-coded representations, respectively.Appendix C provides details of several feature coding and dictionary learning strategies which determine Ω.In our work, the default choice is Soft Assignment and Dictionary learning from Appendices C.1 and C.2 due to their simplicity and good performance.As the Soft Assignment code [78] was adapted to use JEANIE, we kept their number of iterations alpha iter = 50, dic iter = 5.Dictionary size k = 4096 was optimal, whereas τ * ranged between 30 and 60 for smaller and larger datasets, respectively.
During testing, we use the trained model F and the learnt dictionary M , pass test support and query sequences via Eq.( 16) but solve only w.r.t.A by till A converges.Subsequently, we compare the dictionary-coded vectors of query sequences with the corresponding dictionary-coded vectors of support sequences by using some distance measure, e.g., the ℓ 1 or ℓ 2 norm.We also explore the use of kernel-based distances, e.g., Histogram Intersection Kernel (HIK) distance and Chi-Square Kernel (CSK) distance, as they are designed for comparing vectors constrained on the ℓ 1 simplex (Soft Assignment produces the ℓ 1 normalised codes α).The construction of the kernel distance involves a transformation from similarities to distances.
Let α and α ′ be some dictionary-coded vectors obtained by the use of JEANIE in Eq. ( 16).Then for a kernel function k(α, α ′ ), the induced distance between α and α ′ is given by d The closest nearest neighbor match of test query to elements of the test support set determines the test label of the query sequence.

Fusion of Supervised and Unsupervised FSAR
Our final contribution is to introduce four simple strategies for fusing our supervised and unsupervised FSAR approaches to boost the performance.As supervised learning is label-driven and unsupervised learning is reconstructiondriven, we expect both such strategies produce complementary feature spaces amenable to fusion.
In what follows, we make use of both support and query feature maps defined over multiple viewpoints (Ψ , Ψ ′ ∈ R d ′ ×K×K ′ ×τ ): A weighted fusion of supervised and unsupervised FSAR scores.The simplest strategy is to train supervised and unsupervised FSAR models separately, and combine their predictions during testing.We call such a baseline as "weighted fusion".During the testing stage, we combine the distances of supervised and unsupervised models as follows: where d α (•, •) is the distance measure for dictionary-encoded vectors, e.g., the ℓ 1 norm, HIK distance or CSK distance, 0 ≤ ρ ≤ 1 balances the impact of supervised and unsupervised models, respectively.
Finetuning unsupervised model by supervised FSAR.For this baseline strategy, we firstly train the model using unsupervised FSAR, and then we finetune the learnt unsupervised model by using supervised FSAR.During testing stage, we evaluate on supervised learning, unsupervised learning and a fusion of both based on Eq. (17).In this case, one EN is trained which results in two sets of parameters-the first set is based on unsupervised training and the second set is based on supervised finetuning.We call it "finetuning unsup." MAML-inspired fusion of supervised and unsupervised FSAR.Inspired by the success of MAML [26] and categorical learner [59], we introduce a fusion strategy where Algorithm 3 Fusion of Supervised and Unsupervised FSAR by MAML-inspired Setting (one training iteration).
: query/support blocks in batch; F : EN parameters; M and A; alpha iter and dic iter: numbers of iterations for updating A and M ; ω, ω DL and ω EN : the learning rate for A, M and F respectively; B: size of the mini-batch. where (unsupervised FSAR) alpha iter, dic iter, ω, ω DL , ω EN ) (obtain feature maps for parameters F from the unsupervised step) 4: the inner loop uses the unsupervised FSAR (Eq.( 16)) and the outer loop uses the supervised learning loss (Eq.( 14) and ( 15)) for the model update.Algorithm 3 details our MAMLinspired fusion strategy, called "MAML-inspired fusion".Specifically, we start by generating representations with several viewpoints.For each mini-batch of size B we form a set with N ′ feature maps which are passed to Algorithm 2 which updates EN parameters F towards F that help accommodate unsupervised reconstruction-driven learning (so-called task-specific gradient where the task is unsupervised learning).We then recompute N ′ feature maps based on parameters F. Finally, we apply supervised loss on such feature maps but we update now parameters F which means that parameters F are tuned for the global label-driven task with help of unsupervised task.
Intuitively, it is a second-order gradient model.Specifically, one takes the gradient step in the direction pointed by the unsupervised loss to obtain task-specific EN parameters.Subsequently, given these task-specific parameters, task-specific feature maps are extracted and passed into the supervised loss to perform the gradient descent step in the direction pointed by the unsupervised loss to obtain update of global EN parameters.
Fusion by alignment of supervised and unsupervised feature maps.Inspired by domain adaptation [47,48,97], Algorithm 4 in Appendix D is an easy-to-interpret simplification (called "adaptation-based") of the above MAMLinspired fusion.Instead of complex gradient interplay between unsupervised and supervised loss functions, we explicitly align "supervised" feature maps towards "unsupervised" feature maps.

Datasets and Protocols
Below, we describe the datasets and evaluation protocols on which we validate our FSAR with JEANIE.
i. UWA3D Multiview Activity II [84]  As the Kinetics-400 dataset provides only the raw videos, we follow approach [130] and use the estimated joint locations in the pixel coordinate system as the input to our pipeline.To obtain the joint locations, we first resize all videos to the resolution of 340 × 256, and convert the frame rate to 30 FPS.Then we use the publicly available OpenPose [8] toolbox to estimate the location of 18 joints on every frame of the clips.As OpenPose produces the 2D body joint coordinates and Kinetics-400 does not offer multi-view or depth data, we use a network of Martinez et al. [72] pre-trained on Human3.6M[10], combined with the 2D OpenPose output to estimate 3D coordinates from 2D coordinates.The 2D OpenPose and the latter network give us (x, y) and z coordinates, respectively.
Evaluation protocols.For the UWA3D Multiview Activity II, we use standard multi-view classification protocol [84,108], but we apply it to one-shot learning as the view combinations for training and testing sets are disjoint.For NTU-120, we follow the standard one-shot protocol [62].Based on this protocol, we create a similar one-shot protocol for NTU-60, with 50/10 action classes used for training/testing respectively.To evaluate the effectiveness of the proposed method on viewpoint alignment, we also create two new protocols on NTU-120, for which we group the whole dataset based on (i) horizontal camera views into left, center and right views, (ii) vertical camera views into top, center and bottom views.We conduct two sets of experiments on such disjoint view-wise splits (i) (  but testing on the rest unseen 20 classes.Appendix H provides more details of training/evaluation protocols (subject splits, etc.) for small-scale datasets as well as the large scale Kinetics-400 dataset.
Stereo projections.For simulating different camera viewpoints, we estimate the fundamental matrix F (Eq. ( 19)), which relies on camera parameters.Thus, we use the Camera Calibrator from MATLAB to estimate intrinsic, extrinsic and lens distortion parameters.For a given skeleton dataset, we compute the range of spatial coordinates x and y, respectively.We then split them into 3 equally-sized groups to form roughly left, center, right views and other 3 groups for bottom, center, top views.We choose ∼15 frame images from each corresponding group, upload them to the Camera Calibrator, and export camera parameters.We then compute the average distance/depth and height per group to estimate the camera position.On NTU-60 and NTU-120, we simply group the whole dataset into 3 cameras, which are left, center and right views, as provided in [62], and then we compute the average distance per camera view based on the height and distance settings given in the table in [62].

Ablation Studies
We start our experiments by investigating various architectural choices and key hyperparameters of our model.Camera viewpoint simulations.We choose 15 degrees as the step size for the viewpoints simulation.The ranges of camera azimuth and altitude are in [−90 • , 90 • ].Where stated, we perform a grid search on camera azimuth and altitude with Hyperopt [5].Below, we explore the choice of the angle ranges for both horizontal and vertical views.Fig. 9a and 9b (evaluations on the NTU-60 dataset) show that the Block size M and stride size S. Recall from Figure 1, that each skeleton sequence is divided into short-term temporal blocks which may also partially overlap.
Table 2 shows evaluations w.r.t.block size M and stride S, and indicates that the best performance (both 50-class and 20-class settings) is achieved for smaller block size (frame count in the block) and smaller stride.Longer temporal blocks decrease the performance due to the temporal information not reaching the temporal alignment step of JEANIE.Our block encoder encodes each temporal block for learning the local temporal motions, and aggregate these block features finally to form the global temporal motion cues.Smaller stride helps capture more local motion patterns.Considering the accuracy-runtime trade-off, we choose M = 8 and S = 0.6M for the remaining experiments.
GNN as a block of Encoding Network.Recall from Section 3.1 and Appendix B that our Encoding Network uses a GNN block.For that reason, we investigate several models with the goal of justifying our default choice.
We conduct experiments on 4 GNNs listed in Table 3. S 2 GC performs the best on large-scale NTU-60 and NTU-120, APPNP outperforms SGC, and SGC outperforms GCN.We also notice that using GNN as a projection layer performs better than single FC layer used in standard transformer by ∼5%.We note that using the RBF-induced distance for d base (•, •) of JEANIE outperforms the Euclidean distance.We choose S 2 GC as a block of our Encoding Net-  ι-max shift.Recall from Section 3.2 that the ι-max controls the smoothness of alignment.
Table 4 shows the evaluations of ι for the maximum shift.We notice that ι = 2 yields the best results for all the experimental settings on both NTU-60 and NTU-120.Increasing ι does not help improve the performance.We think ι relies on (i) the speeds of action execution (ii) the temporal block size M and the stride S.

Implementation Details
Before we discuss our main experimental results, below we provide network configurations and training details.
Network configurations.Given the temporal block size M (the number of frames in a block) and desired output size d, the configuration of the 3-layer MLP unit is: FC (3M → 6M ), LayerNorm (LN) as in [17], ReLU, FC (6M → 9M ), LN, ReLU, Dropout (for smaller datasets, the dropout rate is 0.5; for large-scale datasets, the dropout rate is 0.1), FC (9M → d), LN.Note that M is the temporal block size and d is the output feature dimension per body joint.
Transformer block.The hidden size of our transformer (the output size of the first FC layer of the MLP in Eq. ( 7)) depends on the dataset.For smaller scale datasets, the depth of the transformer is L tr = 6 with 64 as the hidden size, and the MLP output size is d = 32 (note that the MLP which provides X and the MLP in the transformer must both have the same output size).For NTU-60, the depth of the transformer is L tr = 6, the hidden size is 128 and the MLP output size is d = 64.For NTU-120, the depth of the transformer is L tr = 6, the hidden size is 256 and the MLP size is d = 128.For Kinetics-skeleton, the depth for the transformer is L tr = 12, hidden size is 512 and the MLP output size is d = 256.The number of heads for the transformer of UWA3D Multiview Activity II, NTU-60, NTU-120 and Kinetics-skeleton Training details.The parameters (weights) of the pipeline are initialized with the normal distribution (zero mean and unit standard deviation).We use 1e-3 as the learning rate, and the weight decay is set to 1e-6.We use the SGD optimizer.We set the number of training episodes to 100K for NTU-60, 200K for NTU-120, 500K for 3D Kineticsskeleton, and 10K for UWA3D Multiview Activity II.We use Hyperopt [5] for hyperparameter search on validation sets for all the datasets.

Discussion on Supervised Few-shot Action Recognition
NTU-60.Table 5 (Sup.)shows that using the viewpoint alignment simultaneously in two dimensions, x and y for Euler angles, or azimuth and altitude the stereo projection geometry (CamVPC), improves the performance by 5-8% compared to (Euler) with a simple concatenation of viewpoints, a variant where the best viewpoint alignment path was cho- sen from the best alignment path along x and the best alignment path along y.Euler with (simple concat.) is better than Euler with y rotations only ((V) includes rotations along y while (2V) includes rotations along two axes).We indicate where temporal alignment (T) is also used.When we use HyperOpt [5] to search for the best angle range in which we perform the viewpoint alignment (CamVPC), the results improve further.Enabling the viewpoint alignment for support sequences (CamVPC) yields extra improvement, and our best variant of JEANIE boosts the performance by ∼2%.
We also show that aligning query and support trajectories by the angle of torso 3D joint, denoted (Traj.aligned) are not very powerful.We note that aligning piece-wise parts (blocks) is better than trying to align entire trajectories.In fact, aligning individual frames by torso to the frontal view (Each frame to frontal view) and aligning block average of torso direction to the frontal view (Each block to frontal view) were marginally better.We note these baselines use soft-DTW.NTU-120.Table 6 (Sup.)shows that our proposed method achieves the best results on NTU-120, and outperforms the recent SL-DML and Skeleton-DML by 6.1% and 2.8% respectively (100 training classes).Note that Skeleton-DML requires the pre-trained model for the weights initialization whereas our proposed model with JEANIE is fully differentiable.For comparisons, we extended the view adaptive neural networks [140] by combining them with ProtoNet [91].

Kinetics-skeleton.
We evaluate our proposed model on both 2D and 3D Kinetics-skeleton.We follow the training and evaluation protocol in Appendix H. Table 7 shows that using 3D skeletons outperforms the use of 2D skeletons by 3-4%.The temporal alignment only (with soft-DTW) outperforms baseline (without alignment) by ∼2% and 3% on 2D and 3D skeletons respectively, and JEANIE outperforms the temporal alignment only by around 5%.Our best variant with JEANIE further boosts results by 2%.We notice that the improvements for the use of camera viewpoint simulation (CamVPC) compared to the use of Euler angles are limited, around 0.3% and 0.6% for JEANIE and FVM respectively.The main reason is that the Kinetics-skeleton is a large-scale dataset collected from YouTube videos, and the camera viewpoint simulation becomes unreliable especially when videos are captured by multiple different devices, e.g., camera and mobile phone.

Discussion on Unsupervised Few-shot Action Recognition
Recall from Section 3.4 that JEANIE can help train unsupervised FSAR by forming a dictionary that relies on temporal-viewpoint alignment of JEANIE which factors out nuisance temporal and pose variations in sequences.
However, the choice of feature coding and dictionary learning method can affect the performance of unsupervised learning.Thus, we investigate several variants from Appendix C.
Table 5 (Unsup.)and Table 11 in Appendix E (extension of Table 6 (Unsup.))show on NTU-60 and NTU-120 that the LcSA coder performs better than SA by ∼0.6% and 1.5%, whereas SA outperforms LLC by ∼1.5% and 2%.As LcSA and SA are based on the non-linear sigmoid-like reconstruction functions, we suspect they are more robust than linear reconstruction function of LLC.Since the LcSA is the best performer in our experiments followed by SA and LLC or SC, we choose LcSA for further analysis.
Table 5 (Unsup.),and Tables 11 and 12 in Appendix E (extensions of Tables 6 (Unsup.)and 7 (Unsup.))also show that the choose of different distance measures for comparing the dictionary-coded vectors of sequences during the test stage does not affect the performance by much.The kernelinduced distances, e.g., HIK distance and CSK distance and ℓ 2 -norm outperform the ℓ 1 norm by ∼0.5% on average.We choose the CSK distance for unsupervised JEANIE with LcSA as the default distance for comparing dictionarycoded vectors as it was marginally better performer in the majority of experiments.
Interestingly, FVM in unsupervised learning performs worse compared to our JEANIE, e.g., JEANIE suppresses FVM by ∼3%, 4% and 3% respectively on NTU-60, NTU-120 and Kinetics-skeleton in Tables 5 (Unsup.),6 (Unsup.)and 7 (Unsup.).On UWA3D Multiview Activity II in Table 8 (Unsup.),JEANIE outperforms FVM by more than 5%.This is because FVM always seeks the best local viewpoint alignment for every step of soft-DTW which realizes a non-smooth temporal-viewpoint path in contrast to JEANIE.Without the guidance of label information, FVM fails to capture the corresponding relationships between each temporal and viewpoint alignment.Thus, FVM pro- Sup. duces a worse dictionary than JEANIE which validates the need for factoring out jointly temporal and viewpoint nuisance variations from sequences.

Discussion on JEANIE and FVM
For supervised learning, JEANIE outperforms FVM by 2-4% on NTU-120, and outperforms FVM by around 6% on Kinetics-skeleton.For unsupervised learning, JEANIE improves the performance by around 3% on average on NTU-60, NTU-120 and Kinetics-skeleton.On UWA3D Multiview Activity II, JEANIE suppresses FVM by 4% and 5% respectively for supervised and unsupervised experiments.This shows that seeking jointly the best temporal-viewpoint alignment is more valuable than considering viewpoint alignment as a separate local alignment task (free range alignment per each step of soft-DTW).By and large, FVM often performs better than soft-DTW (temporal alignment only) by 3-5% on average.
To explain what makes JEANIE perform well on the task of comparing pairs of sequences, we perform some visualisations.To this end, we choose skeleton sequences from UWA3D Multiview Activity II for experiments and visualizations of FVM and JEANIE.UWA3D Multiview Activity II contains rich viewpoint configurations and so is perfect for our investigations.We verify that our JEANIE is able to find the better matching distances compared to FVM on two following scenarios.
Matching similar actions.We choose a walking skeleton sequence ('a12 s01 e01 v01') as the query sample with more viewing angles for the camera viewpoint simulation, and we select another walking skeleton sequence of a different view ('a12 s01 e01 v03') and a running skeleton sequence ('a20 s01 e01 v02') as support samples respectively.
Matching actions with similar motion trajectories.We choose a two hand punching skeleton sequence ('a04 s01 e01 v01') as the query sample with more viewing angles for the camera viewpoint simulation, and we Fig. 10: Visualization of FVM and JEANIE for walking vs. walking (two different sequences) and walking vs. running.We notice that for two different action sequences in (b), the greedy FVM finds the path with a very small distance d FVM = 2.68 but for sequences of the same action class, FVM gives d FVM = 4.60.This is clearly suboptimal as the within-class distance is higher then the between-class distance (to counteract this issue, we propose JEANIE).In contrast, our JEANIE is able to produce a smaller distance for within-class sequences and a larger distance for betweenclass sequences, which is a very important property when comparing pairs of sequences.select another two hand punching skeleton sequence of a different view ('a04 s05 e01 v02') and a holding head skeleton sequence ('a10 s05 e01 v02') as support samples respectively.
Figures 10 and 11 show the visualizations.Comparing Figures 10a and 10b of FVM, we notice that for skeleton sequences from different action classes (walking vs. running), FVM finds the path with a very small distance d FVM = 2.68.In contrast, for sequences from the same action class (walking vs. walking), FVM gives d FVM = 4.60 which is higher than in case of within-class sequences.This is an undesired effect which may result in wrong comparison decision.In contrast, in Figures 10c and 10d, our JEANIE gives d JEANIE = 8.57 for sequences of the same action class and d JEANIE = 11.21 for sequences from different action classes, which means that the within-class distances are smaller than between-class distances.This is a very important property when comparing pairs of sequences.for two sequences of the same class.The within-class distance should be smaller than the between-class distance but greedy approaches such as FVM cannot handle this requirement well.JEANIE gives smaller distance when comparing within-class sequences compared to between-class sequences.This is very important for comparing sequences.
Figure 11 provides similar observations that JEANIE produces more reasonable matching distances than FVM.

Discussion on Multi-view Action Recognition
As mentioned in Section 4.5, JEANIE yields good results especially in unsupervised learning, with the performance gain over 5% on UWA3D Multiview Activity II and 4% on NTU-120 multi-view classification protocols.Below we discuss the multi-view supervised FSAR.
Table 8 (Sup.)shows that adding temporal alignment (with soft-DTW) to SGC, APPNP and S 2 GC improves results on UWA3D Multiview Activity II, and the big performance gain is obtained via further adding the viewpoint alignment by JEANIE.Despite the dataset is challenging due to novel viewpoints, JEANIE performs consistently well on all different combinations of training/testing viewpoint settings.This is expected as our method aligns both tem-poral and camera viewpoint which allows a robust classification.JEANIE outperforms FVM by 4.2% and the baseline (temporal alignment only with soft-DTW) by 7% on average.
Influence of camera views has been explored in [105,108] on UWA3D Multiview Activity II, and they show that when the left view V 2 and right view V 3 were used for training and front view V 1 for testing, the recognition accuracy is high since the viewing angle of the front view V 1 is between V 2 and V 3 ; when the left view V 2 and top view V 4 are used for training and right view V 3 is used for testing (or the front view V 1 and right view V 3 are used for training and top view V 4 is used for testing), the recognition accuracies are slightly lower.However, as shown in Table 8 (Sup.), our JEANIE is able to handle the influence of viewpoints and performs almost equally well on all 12 different view combinations which highlights the importance of jointly aligning both temporal and viewpoint modes of sequences.
Table 9 (Sup.)shows the experimental results on the NTU-120.We notice that adding more camera viewpoints to the training process helps the multi-view classification, e.g., using bottom and center views for training and top view for testing, and using left and center views for training and the right view for testing, and the performance gain is more than 4% on (100/same 100).Notice that even though we test on 20 novel classes (100/novel 20) which are never used in the training set, we still achieve 62.7% and 70.8% for multiview classification in horizontal/vertical camera viewpoints.

Fusion of Supervised and Unsupervised FSAR
Recall that Section 3.5 defines two baseline and two advanced fusion strategies for supervised and unsupervised learning due to their complementary nature.
The adaptation-based fusion (Adaptation-based) performs almost as well as the MAML-inspired fusion, within 1% difference across datasets.This is expected as MAML algorithms are designed to learn across multiple tasks, i.e., in our case the unsupervised reconstruction-driven loss and the supervised loss interact together via gradient updates in such a way that the unsupervised information (a form of clustering) is transferred to guide the supervised loss.The domain adaptation inspired feature alignment achieves a similar effect but the transfer between unsupervised and supervised losses occurs at the feature representation level due to feature alignment.
Training one EN with the fusion of both supervised and unsupervised FSAR outperforms a naive fusion of scores (Weighted fusion) from two Encoding Networks trained separately.Finetuning an unsupervised model with supervised loss (Finetuning unsup.)outperforms the weighted fusion.
Table 10 compares different testing strategies on fusion models.The MAML-inspired fusion achieves the best results, with 1.5%, 22.0% and 3.7% improvements when tested on supervised learning, unsupervised learning and a fusion of both.For both adaptation-based and MAMLinspired fusions, testing on unsupervised FSAR only (nearest neighbor on dictionary-encoded vectors) performs close to the results obtained from supervised FSAR only (nearest neighbor on feature maps), i.e., within 5% difference.The reduced performance gap between supervised and unsupervised FSAR suggests that the feature space of EN is adapted to both unsupervised and supervised FSAR.

Conclusions
We have proposed Joint tEmporal and cAmera view-poiNt alIgnmEnt (JEANIE) for sequence pairs and evaluated it on 3D skeleton sequences whose poses/camera views are easy to manipulate in 3D.We have shown that the smooth property of alignment jointly in temporal and viewpoint modes is advantageous compared to the temporal alignment alone (soft-DTW) or models that freely align viewpoint per each temporal block without imposing the smoothness on variations of the matching path.
JEANIE can match correctly support and query sequence pairs as it factors out nuisance variations, which is essential under limited samples of novel classes.Especially, unsupervised FSAR benefits in such a scenario, i.e., when nuisance variations are factored out, sequences of the same class are more likely to occupy similar/same set of atoms in the dictionary.As supervised FSAR forms the feature space driven by the similarity learning loss and the unsupervised FSAR by the dictionary reconstruction-driven loss, fusing both learning strategies has helped achieve further gains.
Our experiments have shown that using the stereo camera geometry is more efficient than simply generating multiple views by Euler angles.Finally, we have contributed unsupervised, supervised and fused FSAR approaches to the small family of FSAR for articulated 3D body joints.

A Euler Rotations and Simulated Camera Views
Euler angles [1] are defined as successive planar rotation angles around x, y, and z axes.For 3D coordinates, we have the following rotation matrices Rx, Ry and Rz: As the resulting composite rotation matrix depends on the order of rotation axes, i.e., RxRyRz ̸ = RzRyRx, we also investigate the algebra of stereo projection.
Stereo projections [2].Suppose we have a rotation matrix R and a translation vector t = [tx, ty, tz] T between left/right cameras (imagine some non-existent stereo camera).Let M l and Mr be the intrinsic matrices of the left/right cameras.Let p l and pr be coordinates of the left/right camera.As the origin of the right camera in the left camera coordinates is t, we have: pr = R(p l −t) and (p l −t) T = (R T pr) T .The plane (polar surface) formed by all points passing through t can be expressed by (p l − t) T (p l × t) = 0.Then, p l × t = Sp l where S = 0 −tz ty tz 0 −tx −ty tx 0 .
Based on the above equations, we obtain pr T RSp l = 0, and note that RS = E is the Essential Matrix, and p T r Ep l = 0 describes the relationship for the same physical point under the left and right camera coordinate system.As E contains no internal information about the camera, and E is based on the camera coordinates, we use a fundamental matrix F that describes the relationship for the same physical point under the camera pixel coordinate system.The relationship between the pixel and camera coordinates is: Suppose the pixel coordinates of p ′ l and p ′ r in the pixel coordinate system are p * l and p * r , then we can write p is the fundamental matrix.Thus, the relationship for the same point in the pixel coordinate system of the left/right camera is: We treat 3D body joint coordinates as p * l .Given F, we obtain their coordinates p * r in the new view.

B Graph Neural Network as a Block of Encoding Network
GNN notations.Firstly, let G = (V, E) be a graph with the vertex set V with nodes {v 1 , ..., vn}, and E are edges of the graph.Let A and D be the adjacency and diagonal degree matrix, respectively.Let Ã = A+I be the adjacency matrix with self-loops (identity matrix) with the corresponding diagonal degree matrix D such that Dii = j (A ij + be the normalized adjacency matrix with added self-loops.For the l-th layer, we use Θ (l) to denote the learnt weight matrix, and Φ to denote the outputs from the graph networks.Below, we list backbones used by us.GCN [42].GCNs learn the feature representations for the features x i of each node over multiple layers.For the l-th layer, we denote the input by H (l−1) and the output by H (l) .Let the input (initial) node representations be H (0) = X.By X we mean some node features for generality of explanation.For our particular case, following the notation in Eq. ( 2), we would be setting H (0) = X for each temporal block.For an L-layer GCN, the output representations are given by: APPNP [43].Personalized Propagation of Neural Predictions (PPNP) and its fast approximation, APPNP, are based on the personalized PageRank.Let H (0) = f Θ (X) be the input to APPNP, where f Θ (•) can be an MLP with parameters Θ.Let the output of the l-th layer be (0) , where α is the teleport (or restart) probability in range (0, 1].For an L-layer APPNP, we have: SGC [127] & S 2 GC [154].SGC captures the L-hops neighborhood in the graph by the L-th power of the transition matrix used as a spectral filter.For an L-layer SGC, we obtain: Based on a modified Markov Diffusion Kernel, Simple Spectral Graph Convolution (S 2 GC) is the summation over l-hops, l = 1, ..., L. The output of S 2 GC is: In case of APPNP, SGC and S 2 GC, |F GN N | = 0 because we do not use their learnable parameters Θ (i.e., think Θ is set as the identity matrix).The GNN outputs are further passes into a Transformer and an FC layer, which returns Ψ ∈ R d ′ ×K×K ′ ×τ query feature maps and Ψ ′ ∈ R d ′ ×τ ′ support feature maps.

C Feature Coding and Dictionary Learning
The core idea of feature coding is to reconstruct a feature vector with codewords by solving a least squares based optimization problem with constraints imposed on the codewords.The full codewords (a.k.a.elements or atoms) compose a dictionary.Atoms in the dictionary are not required to be orthogonal and the dictionary may be an over-complete (the number of atoms is larger than their dimension).For most feature coding algorithms, only a subset of codewords are chosen by the solver to represent a feature vector, and thus the coding vector α may be sparse, i.e., the responses are zeros on those codewords which are not chosen.In what follows, we however replace the Euclidean distance with the JEANIE measure.
The main difference among various feature coding methods lies in the constraint term.Alternatively, we obtain α by defining some specific function α(Ψ i ; M ) that implicitly realizes the regularization term.The choice of Ω(α i , M , Ψ i ) realizes some desired constraints via regularization κ > 0, e.g., Ω(α i , M , Ψ i ) = ∥α i ∥ 1 encourages sparsity of α.

C.1 Feature Coding
Below we detail different feature coders we explore in our work, i.e., Hard Assignment (HA) [14], Sparse Coding (SC) [54,132], Nonnegative Sparse Coding (SC + ) [36], Locality-constrained Linear Coding (LLC) [104], Soft Assignment (SA) [6,27], and Locality-constrained Soft Assignment (LcSA) [46,64].LcSA is our default feature coder due to its simplicity and strong performance.Hard Assignment (HA).This encoder assigns each Ψ to its nearest m by solving the following optimisation problem: Sparse Coding (SC) & Non-negative Sparse Coding (SC + ).SC encodes each Ψ as a sparse linear combination of atoms M by optimising the following objective: whereas SC + additionally imposes a constraint that α ′ ≥ 0. Both SC and SC + encode each Ψ on a subspace of M of size controlled by the sparsity term.
Locality-constrained Linear Coding (LLC).The LLC encoder uses the following criteria for each Ψ : where ⊙ denotes the element-wise multiplication and d ∈ R k is the non-locality penalty that penalises selection of dictionary atoms that are far from Ψ .Specifically, where σ ≥ 0 adjusts the weight decay speed for the non-locality penalty.We further normalize d to be between 0 and 1.The constraint 1 T α ′ = 1 follows the shift-invariant requirements of the LLC encoder.
Soft Assignment (SA) & Locality-constrained Soft Assignment (LcSA).SA expresses each Ψ as the membership probability of Ψ belonging to each m in M , a concept known from the MLE of Gaussian Mixture Models (GMM).SA is derived under equal mixing probability and shared variance σ of GMM components.SA is a closed-form term: The above model usually yields largest values of α ′ i for anchor m i in M that is a close JEANIE neighbor of Ψ .However, even for m i that is far from Ψ , α ′ i > 0. For this reason, SA is only approximately locality-constrained.
LcSA admits the locality-constrained membership probability of the form: where

C.2 Dictionary Learning
For all the above listed feature coding methods, we employ a simple dictionary learning objective which follows Eq. ( 16).We assume some evaluated/fixed dictionary-coded vectors as a coding matrix A ≡ [α 1 , ..., α N ′ ] (N ′ is the number of samples per mini-batch), and we compute: Notice that for fixed A and fixed feature matrices Ψ , the regularization term becomes a constant.For the dictionary learning step, we detach Ψ and α, and run 10 iterations of gradient descent per mini-batch w.r.Table 13: Time cost (seconds) per 10 episodes vs. performance (%) on MSRAction3D.We set stride step S = 5 and M = 10.Dictionary size k = 4096 unless indicated otherwise, and τ * = 30.See the text for remarks about a relatively larger number of epochs required for the convergence of supervised FSAR compared to the unsupervised FSAR.

D Fusion by Alignment
Fusion by alignment of supervised and unsupervised feature maps.
Inspired by domain adaptation, Algorithm 4 performs a fusion of supervised and unsupervised FSAR by alignment of feature maps obtained with supervised and unsupervised FSAR.Specifically, we start by generating representations with several viewpoints.For each minibatch of size B we form a set with N ′ feature maps which are passed to Algorithm 2. Subsequently, from EN parameters F we obtain parameters F that help accommodate unsupervised reconstruction-driven learning.We compute "unsupervised" feature maps for such parameters and encourage "supervised" feature maps to align with them based on the JEANIE measure.Parameter λ ≥ 0 controls the strength of alignment.For the supervised step, we use the supervised loss from Eq. ( 14) and (15).Finally, we update EN parameters F .

E Additional Results on Unsupervised FSAR
Tables 11 and 12 below show additional results on the NTU-120, the 2D and 3D Kinetics-skeleton datasets.

F Training Speeds
Table 13 investigates supervised, unsupervised and fusion strategies in terms of speed.While supervised training appears to be faster, it also takes more episodes to converge, e.g., 400 vs.80.Worth noting is that the unsupervised strategy is a non-optimized code whose dictionary learning and code assignment can be parallelized to bring computations times significantly down.Table 14 below compares training and inference times per query on Titan RTX 2090.For soft-DTW, each query is augmented by K × K ′ = 9 viewpoints.In the test time, we average match distance over K×K ′ = 9 viewpoints of each test query (this is a popular standard test augmentation strategy) w.r.t.support samples.This strategy is denoted as soft-DTWaug.We also apply the above strategy to TAP (denoted as TAPaug).JEANIE also uses K × K ′ = 9 viewpoints per query.We exclude the time of applying viewpoint generation as skeletons can be pre-processed at once (1.6h with non-optimized CPU code) and stored for the future use.Among methods which use multiple viewpoints, JEANIE outperforms soft-DTWaug and TAPaug by 8.2% and 7.4% respectively.JEANIE outperforms ordinary soft-DTW and TAP by 11.3% and 10.8%.For soft-DTWaug and TAPaug, their total training and testing were 5× and 9× slowed compared to counterpart soft-DTW and TAP.This is expected as they had to deal with K ×K ′ = 9 more samples.We tried also parallel JEANIE.Training JEANIEpar with 4 Titan RTX 2090 took 44h, the total inference was 48s.

Fig. 2 :
Fig. 2: One may use (top) stereo projections to simulate different camera views or simply use (bottom) Euler angles to rotate 3D scene.

Fig. 9 :
Fig. 9: The impact of viewing angles in (a) horizontal and (b) vertical camera views on NTU-60.

- 60 oFig. 11 :
Fig.11: Visualization of FVM and JEANIE for two hand punching vs. two hand punching (two different sequences) and two hand punching vs. holding head.We notice that for two different action sequences in (b), the greedy FVM finds the path which results in d FVM = 1.63 for sequences of different action classes, yet FVM gives d FVM = 1.95 for two sequences of the same class.The within-class distance should be smaller than the between-class distance but greedy approaches such as FVM cannot handle this requirement well.JEANIE gives smaller distance when comparing within-class sequences compared to between-class sequences.This is very important for comparing sequences.
Υ ≡ {Ψ b } b∈I B ∪ {Ψ ′ b,n,z } b∈I B Mand A; alpha iter and dic iter: numbers of iterations for updating A and M ; ω, ω DL and ω EN : the learning rate for A, M and F respectively; B: size of the mini-batch.
and d + is a set of within-class distances for the mini-batch of size B given N -way Z-shot learning protocol.By analogy, d − is a set of between-class distances.Function µ(•) is simply the mean over coefficients of the input vector, {•} detaches the graph during the backpropagation step, whereas TopMin β (•) and TopMax NZβ (•) return β smallest and N Zβ largest coefficients from the input vectors, respectively.Thus, Eq. (14) promotes the within-class similarity while Eq.(15) reduces the between-class similarity.Integer β ≥ 0 controls the focus on difficult examples, e.g., β = 1 encourages all within-class distances in Eq. (14) to be close to the positive target µ(TopMin β (•)), the smallest observed within-class distance in the mini-batch.If β > 1, this means we relax our positive target.By analogy, if β = 1, we encourage all between-class distances in Eq. (15) to approach Algorithm 2 Unsupervised FSAR (one training iteration by alternating over variables).Input:F : EN parameters; 1: for i = 1, ..., alpha iter:(fix M and update A) [86]ains 30 actions performed by 9 people in a cluttered environment.The Kinect camera was used in 4 distinct views: front view (V 1 ), left view (V 2 ), right view (V 3 ), and top view (V 4 ).ii.NTU RGB+D (NTU-60)[86]contains 56,880 video sequences and over 4 million frames.This dataset has variable sequence lengths and high intra-class variations.iii.NTU RGB+D 120 (NTU-120) [62] contains 120 action classes (daily/health-related), and 114,480 RGB+D video samples captured with 106 distinct human subjects from 155 different camera viewpoints.iv.Kinetics [40] is a large-scale collection of 650,000 video clips that cover 400/600/700 human action classes.It includes human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.

Table 2 :
The impact of the number of frames M in temporal block under stride step S on results (NTU-60).S = pM , where 1 − p describes the temporal block overlap percentage.Higher p means fewer overlap frames between temporal blocks., 45 • ] performs the best, and widening the range in both views does not increase the performance any further.Table1shows results for the chosen range [−45 • , 45 • ] of camera viewpoint simulations.(Euler simple (K + K ′ )) denotes a simple concatenation of features from both horizontal and vertical views, whereas (Euler (K×K ′ )) and (CamVPC(K×K ′ )) represent the grid search of all possible views.The table shows that Euler angles for the viewpoint augmentation outperform (Euler simple), and (CamVPC) (viewpoints of query sequences are generated by the stereo projection geometry) outperforms Euler angles in almost all the experiments on NTU-60 and NTU-120.This proves the effectiveness of using the stereo projection geometry for the viewpoint augmentation.

Table 3 :
Evaluations of GNN (block of Encoding Network).

Table 5 :
Results on NTU-60 (all use S 2 GC).All methods enjoy temporal alignment by soft-DTW or JEANIE (joint temporal and viewpoint alignment) except where indicated otherwise.We use the ℓ 2 norm for comparing the codes in unsupervised setting with soft-DTW.For unsupervised JEANIE, the distance for comparing the codes is indicated.

Table 6 :
[139,140]tal results on NTU-120 (S 2 GC backbone).All methods enjoy temporal alignment by soft-DTW or JEANIE (joint temporal and viewpoint alignment) except VA[139,140]and other cited works.For VA * , we used soft-DTW on temporal blocks while VA generated temporal blocks.For unsupervised soft-DTW and JEANIE, the best distance for comparing the codes is indicated.For brevity, we list unsupervised variants on LcSA but Table11in Appendix E contains all variants.

Table 7 :
Experiments on 2D and 3D Kinetics-skeleton.Note that we have no results on JEANIE or FVM for 2D coordinates as these require very different viewpoint modeling than 3D coordinates.For brevity, we list unsupervised variants on LcSA but Table12in Appendix E contains more variants.

Table 8 :
Experiments on the UWA3D Multiview Activity II.All with S 2 GC layer unless specified.

Table 10 :
Evaluation of different testing strategies, e.g., with supervised learning, unsupervised learning and a combination of both on Kinetics-skeleton when the model is trained with the fusion of both supervised and unsupervised FSAR.
M NN(Ψ ;k ′ ) returns k ′ nearest neighbors of Ψ in M based on the JEANIE measure, whereas π(•) projects back coefficients of α ′ into α at positions following original indexes of nearest neighbors in dictionary M .Remaining locations in α are zeroed.LcSA forms subspaces of size k ′ .
Fusion of Supervised and Unsupervised FSAR by Feature Maps Alignment (one training iteration).: Γ ≡ {X b } b∈I B ∪ {X ′ b,n,z } b∈I B : EN parameters; M and A; alpha iter and dic iter: numbers of iterations for updating A and M ; ω, ω DL and ω EN : the learning rate for A, M and F respectively; B: size of the mini-batch; t. M .Algorithm 4 F λ: regularization parameter.1

Table 11 :
Experimental results on NTU-120 (S 2 GC backbone).All methods enjoy temporal alignment by soft-DTW or JEANIE (joint temporal and viewpoint alignment).We use the ℓ 2 norm for comparing the codes in unsupervised setting with soft-DTW.For unsupervised JEANIE, the distance for comparing the codes is indicated.

Table 12 :
Experiments on 2D and 3D Kinetics-skeleton.We use the ℓ 2 norm for comparing the codes in unsupervised setting with soft-DTW.For unsupervised JEANIE, the distance for comparing the codes is indicated.