In the Eye of Transformer: Global–Local Correlation for Egocentric Gaze Estimation and Beyond

Predicting human’s gaze from egocentric videos serves as a critical role for human intention understanding in daily activities. In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and further propose a novel global–local correlation module to explicitly model the correlation of the global token and each local token. We validate our model on two egocentric video datasets – EGTEA Gaze + and Ego4D. Our detailed ablation studies demonstrate the benefits of our method. In addition, our approach exceeds the previous state-of-the-art model by a large margin. We also apply our model to a novel gaze saccade/fixation prediction task and the traditional action recognition problem. The consistent gains suggest the strong generalization capability of our model. We also provide additional visualizations to support our claim that global–local correlation serves a key representation for predicting gaze fixation from egocentric videos. More details can be found in our website (https://bolinlai.github.io/GLC-EgoGazeEst).


Introduction
Findings in cognitive neuroscience suggest that eye movements reflect cognitive processes [63], which are essential for understanding human intention during daily activities [21].Such an understanding of visual attention and intention can be valuable for many applications, including Augmented Reality (AR), Virtual Reality (VR), and Human-Robot Interaction (HRI).While wearable eye trackers are a standard way to obtain measurements of gaze behavior, they require calibration, consume significant power, and add substantial cost and complexity to wearable platforms.Alternatively, prior works [1,22,23,24,34,36,44,49,52,53] seek to estimate the visual attention of the camera-wearer from videos captured from a first-person perspective.This task is known as egocentric gaze estimation.
The key challenge in egocentric gaze estimation is to effectively integrate multiple gaze cues into a holistic analysis of visual attention.Cues include the likelihood that different scene objects are gaze targets (i.e.salience), the relative location of gaze targets within the … …

Local Correlation Global-Local Correlation
Egocentric Video Frame video frame (i.e.center prior), and the patterns of camera movement that are reflective of visual attention (i.e. head motions accompanying a gaze shift).Recently, the transformer has achieved great success in various vision tasks by modeling the spatio-temporal correlation among local visual tokens [12,16,32,39,41,48,50,62].However, the pairwise comparisons performed by standard transformer Self-Attention (SA) are not optimized for interpreting local video features in the context of the global scene.The example Fig. 1 illustrates the key role of comparisons between local patches and global context -the gaze target is a salient object pointed at by both the camera wearer and another person.Such a salient object can not be easily localized by only modeling the correlation of local patches.This paper introduces a novel transformer-based deep model that explicitly embeds global context and calculates spatio-temporal global-local correlation for egocentric gaze estimation.Specifically, we design a transformer encoder that adopts a global visual token embedding strategy to incorporate the global scene context.We then introduce a novel Global-Local Correlation (GLC) module that highlights the connection between global and local visual tokens.Finally, we adopt a transformer-based decoder to produce gaze prediction output.We evaluate our approach on two egocentric video datasets -EGTEA Gaze+ [35] and Ego4D [18].Our proposed model is easy to incorporate into existing transformer-based video analysis architectures, and we show that it yields an improvement of more than 3.9% in F1 score over SOTA methods for egocentric gaze estimation.The code and pretrained models will be made publicly available to the research community.In summary, this work makes the following contributions: • We introduce the first transformer-based approach to address the challenging task of egocentric gaze estimation.
• We utilize a global visual token embedding strategy to incorporate global visual context into self-attention, and further introduce a novel Global-Local Correlation module to explicitly model the correlation between global context and each local visual token.
• Our novel design obtains consistent improvement on the EGTEA Gaze+ [35] and Ego4D [18] datasets and outperforms previous state-of-the-art methods by at least 3.9% on EGTEA and 5.6% on Ego4D in F1 score.Importantly, this is the first work that uses the Ego4D dataset for egocentric gaze estimation, which serves as important benchmark for future research in this direction.

Related Work
The task of egocentric gaze estimation, is distinct from prior work on eye tracking [30,43,64] and gaze target prediction from the third person video [9,10,28,46].In the interest of space, we limit our discussion to prior work on egocentric gaze estimation and related works on transformer-based video representation learning and video saliency prediction.Egocentric Gaze Estimation.Previous works have shown that gaze behavior plays an important role in recognizing human daily actions from egocentric videos [1,22,23,24,34,36,37,38,44,49,52,53,65].Here, we discuss the most relevant works that develop deep models for egocentric gaze estimation.Zhang et al. [65] used deep models and an adversarial network to predict egocentric gaze location in future video frames.Huang et al. [22] incorporated temporal attention transition into saliency-based models for gaze estimation.Tavakoli et al. [52] studied both top-down and bottom-up cues that contribute to gaze guidance.Park et al. [49] introduced the novel problem of predicting joint attention during social interaction using egocentric videos.Huang et al. [24] collected a new egocentric video dataset and developed a graphical model to detect joint attention.Thakur et al. [53] proposed a multimodal network that uses both video and inertial measurement unit data for more accurate egocentric gaze estimation.Naas et al. [44] developed a tiling scheme for gaze prediction which enables a more efficient VR content delivery.Importantly, we are the first to develop a transformer-based architecture to address the problem of egocentric gaze estimation.Vision Transformer.Recently, vision transformers [14] have demonstrated superior performance on image classification [13,32,40,48,58,62], detection [5,11,12,16], segmentation [8,50,55,66,67], saliency prediction [39,41,42] and video analysis [2,4,15,33,45,56].In this section, we focus on reviewing previous works that use vision transformers for pixel-wise visual prediction and video understanding.Strudel et al. [50] developed the first transformer-based architecture for semantic segmentation.Cheng et al. [8] further unified semantic, instance, and panoptic segmentation in one transformer architecture.Bertasius et al. [4] proposed TimeSformer for video action recognition.A similar idea is also explored in [2].Fan et al. [15] designed a multiscale video transformer balancing computational cost and action recognition performance.Ma et al. [42] expanded transformers to visual saliency forecasting by using self-attention to capture the correlation between past and future frames.scene.Moreover, the scene context captured from egocentric video is complex and rapidly changing, which requires a gaze estimation model with the ability of explicitly reasoning about the correlation between local visual features and global scene context.In our experiment section, we demonstrate that our proposed GLC module can significantly benefit gaze estimation performance under this challenging setting.

Method
Given an input egocentric video clip with fixed length T and spatial dimension H ×W , our goal is to predict the gaze location in each video frame.Following [35], we consider the gaze prediction as a probabilistic distribution defined on the 2D image plane.Fig. 2 presents an overview of our proposed method.We use the recent multi-scale video transformer (MViT) architecture [15] as the backbone network for video representation learning.We extend MViT by designing the Visual Token Embedding Module to generate the spatio-temporal tokens of both local visual patches and global visual context and feed them into the standard Multi-Head Self-Attention Module.We then utilize a novel Global-Local Correlation (GLC) Module to explicitly model the correlation between global and local visual tokens for gaze estimation.Finally, we make use of the Decoder Network to predict the gaze distribution based on the learned video representation from the GLC module.

Transformer Encoder with Global Visual Token Embedding
Visual Token Embedding.We split the input video sequence into non-overlapping patches with size s T ×s H ×s W and adopt a linear mapping function to project each flattened patch into D-dimension vector space.Following MViT [15], this is equivalent to a convolutional layer with a stride of s T × s H × s W and a number of output channels of D. This operation results into N tokens where N = T s T × H s H × W s W .In addition, the learnable positional embedding E E E ∈ R N×D is added to the local tokens.Our key insight is to further embed global information into a global visual token using convolutional operations, as illustrated in Fig. 2(a).Since there is a single global token, it does not require positional embedding.In our experiments, we examine multiple strategies to embed the global visual token.Multi-Head Self-Attention Module.The N local tokens and one global token are fed into a transformer encoder consisting of multiple self-attention blocks.The number of local tokens is downsampled after each self-attention block, while the number of global tokens remains 1. Suppose the input of the j-th layer of encoder is X X X , where N j is the number of local tokens, D j is the vector length of each token and x x x e denoting the i-th token of size 1 × D j .For simplicity, we omit subscript and superscript of j and multi-head operation in the following equations.In each self-attention layer, correlations are calculated in each token pair as shown in Fig. 2(b).They are used to reweight values of each token after softmax.Formally, we denote the query, key and value matrices of each self-attention layer in an encoder block as The self-attention in transformer encoder is formulated as Finally, we attach a standard linear layer after the self-attention operation.

Global-Local Correlation
Even though global information has been explicitly embedded into the global visual token in our model, the transformer encoder treats the global and local tokens equivalently as shown in Eq. 1 and Fig. 2(b).In this case, global-local correlation is diluted by correlations among the local tokens, limiting its impact on gaze estimation.In order to address this problem, we propose to increase the available capacity to model global-local token interactions.Our solution is a novel Global-Local Correlation module described in Fig. 2(c).Formally, we denote the global token as the first row vector of X X X e , i.e., x x x 1 .Thus q q q 1 , k k k 1 and v v v 1 are the query, key and value projected from the global token, respectively.To explicitly model the connection between global and local visual features, we only calculate the correlation between each local token and the global token, i.e., Correlation(x x x i , x x x 1 ), as well as its self-correlation, i.e., Correlation(x x x i , x x x i ).Then correlation scores are normalized by softmax to further re-weight the values.We exploit a suppression matrix [40] S S S (N+1)×(N+1) to suppress the correlation of other tokens, where We assign zeros to the diagonal and the first column in S S S and set a large value λ for the other elements.We follow the empirical choice from the implementation of [40] and set λ = 10 8 in our experiments.Formally, the proposed GLC can be formulated as the following: In this way, we keep the values on the first column and the diagonal, and map them into probability distributions, while values in other positions are nearly "masked out" after the softmax.Residual connections and linear layers are also used in the GLC module as in the regular self-attention block.Finally, the output tokens from the GLC are concatenated with those from the transformer encoder in the channel dimension.We denote outputs of the GLC and the last encoder block as X X X GLC e ∈ R (N+1)×D and X X X SA e ∈ R (N+1)×D .The concatenation can then be formulated as X X X e = X X X SA e ⊕ X X X GLC e ∈ R (N+1)×2D .The fused tokens X X X e are subsequently fed into the transformer decoder for gaze estimation.

Transformer Decoder
To produce the gaze distribution with the desired spatio-temporal resolution, we adopt a decoder to upsample the encoded features.We utilize a transformer decoder based on the multiscale self-attention block of MViT [15].Suppose each decoder layer takes visual features X X X d ∈ R T H W ×D as inputs and the corresponding query, key and value matrices are As shown in Fig. 2(d), we replace the original pooling operation for the query matrix with an upsampling operation implemented with trilinear interpolation and keep the pooling for the key and value matrices.Following [15], Q Q Q d is obtained by applying a deconvolutional operation on Q Q Q d , while K K K d and V V V d are obtained by applying convolutional operations on K K K d and V V V d .Then, the output of self-attention is calculated in the same way as Eq. 1.In addition, we keep the skip connection in the selfattention layers and replace the pooling operation in skip connections with trilinear interpolation, which produces the upsampled output with dimension T H W × D .Our decoder is composed of 4 decoding blocks.Skip connections are used to combine intermediate features of the encoder with corresponding decoder features.Finally, another linear mapping function is used to output the final gaze prediction.

Network Architecture and Model Training
We adopt MViT [15] as the backbone, with weights initialized from Kinetics-400 pretraining [27].The token embedding stride is set as s T = 2, s H = 4 and s W = 4 and the embedding dimension is D = 96.The encoder is composed of 16 self-attention layers that are divided into 4 blocks.The number of tokens is downsampled at the transition between two blocks.We build the decoder with 4 decoder blocks corresponding to the 4 blocks in the encoder.After getting raw output from decoder, softmax is applied on each frame with a temperature τ.This can be formally written as pi j = exp( ŷi j /τ) ∑ i, j exp( ŷi j /τ) where ŷi j is the logit at location (i, j) from the model and pi j is probability after softmax.In experiments, τ is empirically set as 2. We use KL-divergence loss to capture the difference between labels and predictions.More details of the training parameters can be found in the supplementary.

Datasets and Metrics
We conducted our experiments on two egocentric video datasets with gaze tracking data to serve as ground truth -EGTEA Gaze+ [35] and Ego4D [18].The EGTEA dataset is captured under the meal preparation setting, which involves a great deal of hand-object interactions.for training and the other 7 videos (5202 clips) for testing.This is the first work that uses the Ego4D dataset for egocentric gaze estimation, and we will make our split publicly available to drive future research on this topic.Following [35,36], we adopt F1 score, recall, and precision as the evaluation metrics.

Experimental Results
The Design Choice of Global Visual Embedding.Our key insight is embedding the global visual information into the transformer architecture for egocentric gaze estimation.Here, we first explore 4 global visual embedding strategies -(a) direct max pooling on the input, (b) max pooling on the unflattened local tokens, (c) convolutional layers on the input and (d) convolutional layers on the unflattened local tokens.
As shown in Table 1, all four global embedding strategies improve the performance of vanilla MViT model on both the EGTEA dataset and the Ego4D dataset.This result supports our claim that global context is essential for gaze estimation.Among the four embedding strategies, (d) achieves the largest performance improvement on both datasets (+0.9% on EGTEA and +0.8% on Ego4D).This indicates that convolutional layers and the embedded local tokens can facilitate the learning of global context.Thus, we use this strategy in the following experiments.Note that all baseline methods use the same transformer decoder.Evaluation of Global-Local Correlation.We also evaluate the Global-Local Correlation (GLC) module of our model.As presented in Table 1, our full model -MViT+(d)+GLC outperforms the baseline MViT by +1.8% on EGTEA dataset and +2.2% on Ego4D dataset.Specifically, the GLC module contributes to a performance gain of +0.9% on EGTEA Gaze+ and +1.4% on Ego4D (comparing to MViT+(d)).This result suggests that the GLC can break down the mathematical equivalence of global and local tokens in regular self-attention, thereby "highlighting" the global-local connection in the learned representation.Does the Performance Improvement Come from Additional Parameters?It is possible that the performance of our model benefits from additional parameters in the GLC module.
In Table 1 module with a regular self-attention (SA) layer.Interestingly, the additional SA layer has minor influence on the overall performance (+0.2% on EGTEA and 0.4% on Ego4D).In contrast, our model outperforms this baseline by +0.7% on EGTEA and +1.0% on Ego4D.This result indicates that the performance boost of our method does not simply come from the additional parameters of GLC.Instead, the explicit modeling of the connection between global and local visual features is the key factor in the performance gain.
Comparison with Previous State-of-the-Art.In addition to these studies to evaluate the components of our model, we compare our approach with prior work.Results are presented in Table 2 and Table 3.Interestingly, the baseline MViT model easily outperforms all previous works that use CNN-based architectures on both the EGTEA dataset and the Ego4D dataset.In addition, our proposed method outperforms the best CNN model by +3.9% on F1, +4.0% on recall and +3.5% on precision.On Ego4D, our method surpasses the best CNN model by +5.6% on F1, +4.5% on recall and +5.5% on precision.These results demonstrate the superiority of using a transformer-based architecture for egocentric gaze estimation as well as the effectiveness and robustness of our proposed method.We note that the improvement of our model is more prominent on Ego4D than EGTEA Gaze+.We speculate that this is because the Ego4D videos with gaze tracking data are captured under social interaction scenarios that contain interactions with both people and objects, and thus require the model to more heavily consider the global-local connections (e.g. the visual information about a social partner's gesture to an object) to predict the gaze.Another possible reason is that the Ego4D dataset has more samples to train the transformer-based model.

Remarks
Visualization of Predictions.We visualize predictions of our model and other previous methods in Fig. 3. Attention transition [22] tends to overestimate gaze area which includes more uncertainty and ambiguity.I3D-R50 [17] and vanilla MViT [15] architectures run into failure modes when there are multiple objects and people in the scene.In contrast, our model, by explicitly modeling the connection between the global and local visual tokens, more robustly predicts the egocentric gaze distribution from the input video clip.

What has been learned by the Global-Local Correlation module?
We additionally empirically analyze our proposed GLC module.We first calculate the correlation of the global token and each local token, and then normalize the calculated weights into a probabilistic distribution.A higher score suggests that the GLC captures a stronger connection between the particular local token and the global context.We reshape and upsample these weight distributions to form a heatmap, which we overlay with the original input.Since the GLC module applies a multi-head operation, we visualize the results from different heads in Fig. 4. Interestingly, the correlations captured by the GLC heads are quite diverse.We observe that the GLC module does assign different weights to local tokens, thereby capturing the different global-local connections for each token.Another important finding is that some heads learn to attend to background pixels to prevent the model from omitting important scene context.We provide further commentary in the supplementary.

Conclusion
In this paper, we develop a transformer-based architecture to address the task of estimating the camera wear's gaze fixation based only on egocentric video frames.Our key insight is that our global visual token embedding strategy, which encodes global visual information into the self-attention mechanism, and our global-local correlation (GLC) module, which explicitly reasons about the connection between global and local visual tokens, facilitate strong representation learning for egocentric gaze estimation.Our experiments on the EGTEA Gaze+ and Ego4D datasets demonstrate the effectiveness of our approach.We believe our work serves as an essential step in analyzing gaze behavior from egocentric videos and provides valuable insight into learning video representations with transformer architectures.Table 4: Results of action recognition on EGTEA Gaze+.We implemented two methods for classification -adding an additional class token or using global average pooling."-"" means the result is unavailable.The complete models are highlighted.
After generating the output from decoder block4, a convolutional layer is applied only on the local tokens to compress the 8 channels to 1.We then convert this to a probability distribution by applying softmax to each frame.

C Experiments on Action Recognition
In addition to egocentric gaze estimation in the main paper, we also examine the application of our GLC module to the egocentric video action recognition task, and find that our method performs competitively with methods designed specifically for this task on EGTEA Gaze+.
To this end, we remove the decoder in the gaze estimation model and keep only the visual token embedding, transformer encoder, and GLC modules.Generally, there are two ways to obtain activity class category prediction: adding a class embedding token at the first layer of transformer, or using pooling across all global tokens to obtain a final embedding.Then a fully-connected layer followed by softmax is used to predict probabilities for each category.We implement both strategies and compare our approaches with previous works in Table 4.
We conduct these experiments only on EGTEA Gaze+ [35] using the same split as gaze estimation.Note that the Ego4D [18] social benchmark does not contain action labels.For vanilla MViT [15], class token embedding performs better than the pooling operation.For both methods, simply adding global embedding has a minor influence on the overall performance (−0.2% on top1 accuracy, −0.5% on top5 accuracy and +1.32% on mean class accuracy while using the class token, and −0.39%, on top1 accuracy, −0.19% on top5 accuracy and −1.19% on mean class accuracy while using pooling layer).This result suggests that simply adding global context as an additional token has minor influence on the action recognition performance.
In addition, adding our GLC module can only improve the model performance by a small margin when using class token embedding to predict action classes.We hypothesize that this is because only the class token is input into the linear layer for final prediction and re-weighted tokens from GLC are left unused.In contrast, when applying global average pooling on all local tokens, GLC improves top1, top5 and mean class accuracy over the counterpart that doesn't use GLC (MViT+Global Token) by +2.27%, +0.59% and +3.11%, respectively.Gains over corresponding the MViT baseline are +1.88%,+0.4% and +1.92% on the three metrics.These results indicate our proposed GLC module is a robust and general design that also improves the action recognition performance.However, the impact on action recognition is smaller compared with egocentric gaze estimation.
We note that our model achieves a competitive performance for action recognition on  [19] recent state-of-the-art method for this benchmark of 66.5%.We also want to emphasize that we conduct these action recognition experiments to demonstrate the generalization ability of our proposed GLC module rather than aim to produce SOTA results on action recognition.Additionally, we visualize the global-local correlation weights of the GLC in Fig. 5. Importantly, the learned global-local correlation is vastly different from the gaze distribution when the model is trained for action recognition; in contrast, a stronger connection between the learned global-local correlation and gaze distribution can be observed when the model is trained for gaze estimation (see Fig. 8).How to design a weakly-supervised model for egocentric gaze estimation remains an open question.

D Details of Different Global Visual Embedding Strategies
We present further details of the four global visual embedding strategies we studied in Section 4.2 of the main paper.As demonstrated in Fig. 6, (a) implements max pooling on input frames directly, and (b) implements max pooling on local visual tokens.For (c) and (d), we replace max pooling operations in (a) and (b) with a sequence of convolutional layers.The specific parameters of (d) are detailed in Table 5.For global embedding in (c), input video frames are fed into a convolutional layer that is identical to the layer used for local token embedding (i.e., kernel is 3 × 7 × 7 and stride is 2 × 4 × 4.) Then, the output is passed to a sequence of convolutional layers identical to (d).

E More Visualization Examples of Gaze Estimation
More visualizations of gaze prediction of both our model and previous state-of-the-art approaches are presented in Fig. 7. Our proposed model can accurately predict the gaze distribution even when the scene context is very complicated, while the other three approaches may be misled by background objects or produce predictions with too much uncertainty.We provide more examples of GLC visualizations in Fig. 8.The 8 heads capture features of different areas which is consistent with the examples in the main paper.On the EGTEA Gaze+ dataset, the maps produced by heads 1, 4, 5, and 8 highlight pixels around the gaze point with different uncertainty (which is illustrated by the size of highlighted area).The other four heads focus on surrounding objects and leave gaze areas unattended.As for the Ego4D data, only head 3 captures the wearers' attention, while the other heads fully focus on the backgrounds in different aspects.This supports our key conclusion in the main paper that our GLC module learns to model human attention by setting different weights from local to global tokens, capturing many facets of scene information (both around the gaze target and in the background) in the multi-headed attention mechanism.

F Future Work
In this paper, we studied the explicit integration of global scene context for egocentric gaze estimation and proposed a novel modeling approach for this problem.We also showed the results of our proposed architecture on egocentric action recognition in this supplementary material to demonstrate our model's generalization ability.Our findings also point to several exciting future research directions: • Our proposed GLC module has the potential to address other video understanding tasks including visual saliency prediction in third-person video, active object detection, and future forecasting.We plan to study the effect of our method on those tasks in our future work.
• Our modeling work can be expanded to understanding human gaze behavior associated with multiple sensing modalities, especially in the social conversation setting.An exciting future direction is incorporating audio signals into egocentric gaze estimation.
• Our proposed GLC fails to learn the gaze distribution when the model is trained to predict the action labels.How to design a weakly supervised model for egocentric gaze estimation using action labels is an interesting problem.

Figure 1 :
Figure 1: Example of local correlation and global-local correlation for the task of egocentric gaze estimation (predicting where the camera-wearer is looking using egocentric video alone).The red dot represents the gaze ground truth (from a wearable eye tracker) and the image patch that contains the gaze target has red edges.Global-local correlation models the connections between the global context and each local patch, making it possible to capture, e.g., the camera wearer and social partner are pointing at the salient object.In contrast, local-local correlations may not yield an effective representation of the scene context.

Figure 2 :
Figure 2: Architecture of the proposed model.The model consists of four modules -(a) Visual Token Embedding Module encodes the input into local tokens and one global token, (b) Transformer Encoder is composed of multiple regular self-attention and linear layers, (c) Global-Local Correlation Module models the correlation of global and local tokens, and (d) Transformer Decoder maps encoded video features from Transformer Encoder and GLC to gaze prediction.⊕ denotes concatenation along the channel dimension.

Figure 3 :Figure 4 :
Figure 3: Visualization of gaze estimation.The first sample is from EGTEA Gaze+ and the second is from Ego4D.Estimated gaze is represented as a heatmap overlayed on input frames.Green dots denote the ground truth gaze location.

Figure 5 :
Figure 5: Visualization of the eight heads in global-local correlation module for action recognition.

Figure 6 :
Figure 6: Four different approaches of global visual token embedding.

Figure 7 :Figure 8 :
Figure 7: Visualization of gaze estimation.Both successful cases (in green box) and failure cases (in red box) of our model are demonstrated.Green dots present ground truth.

Table 1 :
Evaluation of different global embedding approaches and global-local correlation module.(a)(b)(c)(d) are different global embedding strategies elaborated in Section 4.2.SA and GLC denote regular self-attention and global-local correlation module, respectively.We used the first train/test split from EGTEA in our experiments (8299 clips for training and 2022 clips for testing).The Ego4D dataset includes 27 videos of 80 participants totaling 31 hours with gaze tracking data captured under the social setting.We split the long videos into 5-second video clips and pick clips containing gaze fixation.More details of the selection of clips with gaze fixations are discussed in supplementary.We used 20 videos (15310 clips)

Table 2 :
, we report the results of another baseline model, where we replace the GLC Comparison with previous methods on EGTEA Gaze+.Our complete model is highlighted.The proposed model outperforms previous approaches by a significant margin.

Table 3 :
Comparison with previous methods on Ego4D.Our complete model is highlighted.The model shows consistent superiority over other state of the arts on all metrics.

Table 5 :
Architecture of the proposed model.Convolutional layers are denoted as Conv(kernel size, out put channels).Numbers of input channels of multi-head self-attention are shown in the parenthesis of MSA.Dimensions of the hidden layer in multi-layer perceptrons are listed in parenthesis of MLP.In tokenization, local and global tokens are reshaped and concatenated.In global-local correlation, the output is concatenated with its input in the channel dimension.Head only takes local tokens as input.