Vision Transformers with Hierarchical Attention

This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.


Introduction
In the last decade, convolutional neural networks (CNNs) have been the go-to architecture in computer vision, owing to their powerful capability in learning representations from images/videos [1][2][3][4][5][6][7][8][9][10][11][12].Meanwhile, in another field of natural language processing (NLP), the transformer architecture [13] has been the de-facto standard to handle long-range dependencies [14,15].Transformers rely heavily on self-attention to model global relationships of sequence data.Although global modeling is also essential for vision tasks, the 2D/3D structures of vision data make it less straightforward to apply transformers therein.This predicament was recently broken by Dosovitskiy et al. [16], by applying a pure transformer to sequences of image patches.
Motivated by [16], a large amount of literature on vision transformers has emerged to resolve the problems caused by the domain gap between computer vision and NLP [17][18][19][20][21]. From our point of view, one major problem of vision transformers is that the sequence length of image patches is much longer than that of tokens (words) in an NLP application, thus leading to high computational/space complexity when computing the Multi-Head Self-Attention (MHSA).Some efforts have been dedicated to resolving this problem.ToMe [22] improves the throughput of existing ViT models [16] by systematically merging similar tokens through the utilization of a general and light-weight matching algorithm.PVT [19] and MViT [21] downsample the feature to compute attention in a reduced length of tokens but at the cost of losing fine-grained details.Swin Transformer [18] computes attention within small windows to model local relationships, and it gradually enlarges the receptive field by shifting windows and stacking more layers.From this point of view, Swin Transformer [18] may still be suboptimal because it works in a similar manner to CNNs and needs many layers to model long-range dependencies [16].
Building upon the discussed strengths of downsampling-based transformers [19,21] and window-based transformers [18], each with its distinctive merits, we aim to harness their complementary advantages.Downsampling-based transformers excel at directly modeling global dependencies but may sacrifice fine-grained details, while window-based transformers effectively capture local dependencies but may fall short in global dependency modeling.As widely accepted, both global and local information is essential for visual scene understanding.Motivated by this insight, our approach seeks to amalgamate the strengths of both paradigms, enabling the direct modeling of both global and local dependencies.
To achieve this, we introduce the Hierarchical Multi-Head Self-Attention (H-MHSA), a novel mechanism that enhances the flexibility and efficiency of self-attention computation in transformers.Our methodology begins by segmenting an image into patches, treating each patch akin to a token [16].Rather than computing attention across all patches, we further organize these patches into small grids, performing attention computation within each grid.This step is instrumental in capturing local relationships and generating more discriminative local representations.Subsequently, we amalgamate these smaller patches into larger ones and treat the merged patches as new tokens, resulting in a substantial reduction in their number.This enables the direct modeling of global dependencies by calculating self-attention for the new tokens.Ultimately, the attentive features from both local and global hierarchies are aggregated to yield potent features with rich granularities.Notably, as the attention calculation at each step is confined to a small number of tokens, our hierarchical strategy mitigates the computational and space complexity of vanilla transformers.Empirical observations underscore the efficacy of this hierarchical selfattention mechanism, revealing improved generalization results in our experiments.
By simply incorporating H-MHSA, we build a family of Hierarchical-Attention-based Transformer Networks (HAT-Net).To evaluate the efficacy of HAT-Net in scene understanding, we experiment HAT-Net for fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.Experimental results demonstrate that HAT-Net performs favorably against previous backbone networks.Note that H-MHSA is based on a very simple and intuitive idea, so H-MHSA is expected to provide a new perspective for the future design of vision transformers.

Related Work
Convolutional neural networks.More than two decades ago, LeCun et al. [23] built the first deep CNN, i.e., LeNet, for document recognition.About ten years ago, AlexNet [1] introduced pooling layers into CNNs and pushed forward the state of the art of ImageNet classification [24] significantly.Since then, CNNs have become the de-facto standard of computer vision owing to its powerful ability in representation learning.Brilliant achievements have been seen in this direction.VGGNet [2] investigates networks of increasing depth using small (3 × 3) convolution filters.ResNet [3] manages to build very deep networks by resolving the gradient vanishing/exploding problem with residual connections [25].GoogLeNet [26] presents the inception architecture [27,28] using multiple branches with different convolution kernels.ResNeXt [29] improves ResNet [3] by replacing the 3 × 3 convolution in the bottleneck with a grouped convolution.DenseNets [30] present dense connections, i.e., using the feature maps of all preceding layers as inputs for each layer.MobileNets [31,32] decompose the traditional convolution into a pointwise convolution and a depthwise separable convolution for acceleration, and an inverted bottleneck is proposed for ensuring accuracy.ShuffleNets [33,34] further decompose the pointwise convolution into pointwise group convolution and channel shuffle to reduce computational cost.MansNet [35] proposes an automated mobile neural architecture search approach to search for a model with a good trade-off between accuracy and latency.Efficient-Net [36] introduces a scaling method to uniformly scale depth/width/resolution dimensions of the architecture searched by MansNet [35].The above advanced techniques are the engines driving the development of computer vision in the last decade.This paper aims at improving feature representation learning by designing new transformers.
Self-attention mechanism.Inspired by the human visual system, the self-attention mechanism is usually adopted to enhance essential information and suppress noisy information.STN [37] presents the spatial attention mechanism through learning an appropriate spatial transformation for each input.Chen et al. [38] proposed the channel attention model and achieved promising results on the image captioning task.Wang et al. [39] explored self-attention in well-known residual networks [3].SENet [40] applies channel attention to backbone network design and boosts the accuracy of ImageNet classification [24].CBAM [41] sequentially applies channel and spatial attention for adaptive feature refinement in deep networks.BAM [42] produces a 3D attention map by combining channel and spatial attention.SK-Net [43] uses channel attention to selectively fuse multiple branches with different kernel sizes.Non-local network [44] presents non-local attention for capturing long-range dependencies.ResNeSt [45] is a milestone in this direction.It applies channel attention on different network branches to capture cross-feature interactions and learn diverse representations.Our work shares some similarities with these works by applying self-attention for learning feature representations.The difference is that we propose H-MHSA to learn global relationships rather than a simple feature recalibration using spatial or channel attention in these works.
Vision transformer.Transformer [13] entirely relies on self-attention to handle long-range dependencies of sequence data.It was first proposed for NLP tasks [14,15].In order to apply transformers on image data, Dosovitskiy et al. [16] split an image into patches and treated them as tokens.Then, a pure transformer [13] can be adopted.Such a vision transformer (ViT) attains competitive accuracy for ImageNet classification [24].More recently, lots of efforts have been dedicated to improving ViT.T2T-ViT [46] proposes to split an image into tokens of overlapping patches so as to represent local structures by surrounding tokens.CaiT [47] builds a deeper transformer network by introducing a per-channel weighting and specific class attention.DeepViT [48] proposes Re-attention to re-generate attention maps to increase their diversity at different layers.DeiT [49] presents a knowledge distillation strategy for improving the training of ViT [16].Srinivas et al. [50] tried to add the bottleneck structure to vision transformers.Some works build pyramid transformer networks to generate multi-scale features [17][18][19][20][21]. PVT [19] adopts convolution operation to downsample the feature map in order to reduce the sequence length in MHSA, thus reducing the computational load.Similar to PVT [19], MViT [21] utilizes pooling to compute attention on a reduced sequence length.Swin Transformer [18] computes attention within small windows and shifts windows to gradually enlarge the receptive field.CoaT [20] computes attention in the channel dimension rather than in the traditional spatial dimension.ToMe [22] enhances the throughput of existing ViT models [16] without requiring retraining, which is achieved by gradually combining similar tokens in a transformer using a matching algorithm.In this paper, we introduce a novel design to reduce the computational complexity of MHSA and learn both the global and local relationship modeling through vision transformers.

Vision MLP networks.
While CNNs and vision transformers have been widely adopted for computer vision tasks, Tolstikhin et al. [51] challenged the necessity of convolutions and attention mechanisms.They introduced the MLP-Mixer architecture, which relies solely on multi-layer perceptrons (MLPs).MLP-Mixer incorporates two types of layers: one applies MLPs independently to image patches, facilitating the mixing of perlocation features, and the other applies MLPs across patches, enabling the mixing of spatial information.Despite lacking convolutions and attention, MLP-Mixer demonstrated competitive performance in image classification compared to state-of-the-art models.Liu et al. [52] introduced gMLP, an MLP-based model with gating, showcasing its comparable performance to transformers in crucial language and vision applications.In contrast to other MLP-like models that encode spatial information along flattened spatial dimensions, Vision Permutator [53] uniquely encodes feature representations along height and width dimensions using linear projections.Wang et al. [54] proposed a novel positional spatial gating unit, leveraging classical relative positional encoding to efficiently capture cross-token relations for token mixing.Despite these advancements, the performance of vision MLP networks still lags behind that of vision transformers.In this paper, we focus on the design of a new vision transformer network.

Methodology
In this section, we first provide a brief review of vision transformers [16] in Sec.3.1.Then, we present the proposed H-MHSA and analyze its computational complexity in Sec.3.2.Finally, we describe the configuration details of the proposed HAT-Net in Sec.3.3.

Review of Vision Transformers
Transformer [13,16] heavily relies on MHSA to model long-range relationships.Suppose X ∈ R H×W ×C denotes the input, where H, W , and C are the height, width, and the feature dimension, respectively.We reshape X and define the query where W q ∈ R C×C , W k ∈ R C×C , and W v ∈ R C×C are the trainable weight matrices of linear transformations.With a mild assumption that the input and output have the same dimension, the traditional MHSA can be formulated as in which √ d means an approximate normalization, and the Softmax function is applied to the rows of the matrix.Note that we omit the concept of multiple heads here for simplicity.In Eq. ( 2), the matrix product of QK T first computes the similarity between each pair of tokens.Each new token is then derived over the combination of all tokens according to the similarity.After the computation of MHSA, a residual connection is further added to ease the optimization, like in which W p ∈ R C×C is a trainable weight matrix for feature projection.At last, a multilayer perceptron (MLP) is adopted to enhance the representation, like where Y denotes the output of a transformer block.
It is easy to infer that the computational complexity of MHSA (Eq.( 2)) is Similarly, the space complexity (memory consumption) also includes the term of O(H 2 W 2 ).As commonly known, O(H 2 W 2 ) could become very large for high-resolution inputs.This limits the applicability of transformers for vision tasks.Motivated by this, we aim at improving MHSA to reduce such complexity and maintain the capacity of global relationship modeling without the risk of sacrificing performances.

Hierarchical Multi-Head Self-Attention
In this section, we present an approach to alleviate the computational and space demands associated with Eq. ( 2) through the utilization of our proposed H-MHSA mechanism.Rather than computing attention over the entire input, we adopt a hierarchical strategy, allowing each step to process only a limited number of tokens.The initial step concentrates on local attention computation.Assuming the input feature map is Fig. 1 Illustration of the proposed HAT-Net.GAP: global average pooling; FC: fully-connected layer.×L i means that the transformer block is repeated for L i times.H and W denote the height and width of the input image, respectively.
Table 1 Network configurations of HAT-Net.The settings of building blocks are shown in brackets, with the number of blocks stacked.For the first stage, each convolution has C channels and a stride of S. For the other four stages, each MLP uses a K × K DW-Conv and an expansion ratio of E. Note that we omit the downsampling operation after the t-th stage (t = {2, 3, 4}) for simplicity."#Param" refers to the number of parameters.

Stage Input Size
Operator HAT-Net-Tiny HAT-Net-Small HAT-Net-Medium HAT-Net-Large denoted as X ∈ R H×W ×C , we partition the feature map into small grids of size G 1 × G 1 and reshape it as follows: )×(G1×G1)×C .
(6) The query, key, and value are then calculated by R C×C are trainable weight matrices.Subsequently, Eq. ( 2) is applied to generate the local attentive feature A 1 .To ease network optimization, we reshape A 1 back to the shape of and incorporate a residual connection: As the local attentive feature A 1 is computed within each small G 1 × G 1 grid, a substantial reduction in computational and space complexity is achieved.
The second step focuses on global attention calculation.Here, we downsample A 1 by a factor of G 2 during the computation of key and value matrices.This downsampling enables efficient global attention calculation, treating each G 2 × G 2 grid as a token.This process can be expressed as where AvePool G2 (•) denotes downsampling a feature map by G 2 times using average pooling with both the kernel size and stride set to G 2 .Consequently, we have ×C .We then reshape A 1 and A 1 as follows: Following this, we compute the query, key, and value as )×C , and )×C .Subsequently, Eq. ( 2) is called to obtain the global attentive feature A 2 ∈ R (H×W )×C , followed by a reshaping operation: The final output of H-MHSA is given by where W p has the same meaning as in Eq. ( 3).In this way, H-MHSA effectively models both local and global relationships, akin to vanilla MHSA.The computational complexity of H-MHSA can be expressed as Compared to Eq. ( 5), this represents a reduction in computational complexity from ).The same conclusion can be easily derived for space complexity.
We continue by comparing H-MHSA with existing vision transformers, highlighting distinctive features.Swin Transformer [18] focuses on modeling local relationships, progressively expanding the receptive field through shifted windows and additional layers.Conversely, PVT [19] prioritizes global relationships through downsampling key and value matrices but overlooks local information.In contrast, our proposed H-MHSA excels by concurrently capturing both local and global relationships.While Swin Transformer employs a fixed window size (i.e., a fixed-size bias matrix), and PVT uses a constant downsampling ratio (i.e., a convolution with the kernel size equal to the stride), these approaches necessitate retraining on the ImageNet dataset [24] for any reparameterization.In contrast, the parameter-free nature of G 1 and G 2 in H-MHSA allows flexible configuration adjustments for downstream vision tasks without the need for retraining on ImageNet.
In computer vision, achieving a comprehensive understanding of scenes relies on the simultaneous consideration of both global and local information.Within the framework of our proposed H-MHSA, global self-attention calculation (Eq.( 10)-( 13)) is instrumental in establishing the foundation for scene interpretation, enabling the recognition of overarching patterns and aiding in high-level decision-making processes.Concurrently, local self-attention calculation (Eq.( 6)-( 9)) is crucial for refining the understanding of individual components within the larger context, facilitating more detailed and nuanced scene analysis.H-MHSA excels in striking the delicate balance between global and local information, thereby facilitating a nuanced and accurate comprehension of diverse scenes.In essence, the seamless integration of global and local self-attention within the H-MHSA framework empowers transformers to navigate the intricacies of scene understanding, facilitating context-aware decision-making.
The overall architecture of HAT-Net is illustrated in Fig. 1.At the beginning of HAT-Net, instead of flattening image patches [16], we apply two sequential vanilla 3 × 3 convolutions, each of which has a stride of 2, to downsample the input image into 1/4 scale.Then, we stack H-MHSA and MLP alternatively, which can be divided into four stages with pyramid feature scales of 1/4, 1/8, 1/16, and 1/32, respectively.For feature downsampling at the end of each stage, a vanilla 3 × 3 convolution with a stride of 2 is used.The configuration details of HAT-Net are summarized in Table 1.We provide four versions of HAT-Net: HAT-Net-Tiny, HAT-Net-Small, HAT-Net-Medium, and HAT-Net-Large, whose number of parameters is similar to ResNet18, ResNet50, ResNet101, and ResNet152 [3], respectively.We only adopt simple parameter settings without careful tuning to demonstrate the effectiveness and generality of HAT-Net.The dimension of each head in the multi-head setting is set to 48 for HAT-Net-Tiny and 64 for other versions.
To enhance the applicability of HAT-Net across diverse vision tasks, we present guidelines for configuring the parameter-free G 1 and G 2 .While established models like Swin Transformer [18] adhere to a fixed window size of 7, and PVT [19] employs a set of constant downsampling ratios 8, 4, 2 for the t-th stage (t = 2, 3, 4), we advocate for certain adjustments.Practically, we find that a window size of 8 is more pragmatic than 7, given that input resolutions often align with multiples of 8.Moreover, augmenting downsampling ratios serves to mitigate computational complexity.Consequently, for image classification on the ImageNet dataset [24], where the standard input resolution is 224 × 224 pixels, we designate G 1 = 8, 7, 7 and G 2 = 8, 4, 2 for the t-th stage (t = 2, 3, 4).Here, a window size of 7 is necessitated by the chosen resolution, and small downsampling rates are in line with the approach taken by PVT [19].In scenarios involving downstream tasks like semantic segmentation, object detection, and instance segmentation, where input resolutions tend to be larger, we opt for G 1 = 8, 8, 8 for convenience and G 2 = 16, 8, 4 to curtail computational expenses.For a comprehensive analysis of the impact of different G 1 and G 2 settings, we conduct an ablation study in Sec.4.4.

Experiments
To show the superiority of HAT-Net in feature representation learning, this section evaluates HAT-Net for image classification, semantic segmentation, object detection and instance segmentation.

Image Classification
Experimental setup.The ImageNet dataset [24] consists of 1.28M training images and 50K validation images from 1000 categories.We adopt the training set to train our networks and the validation set to test the performance.We implement HAT-Net using the popular PyTorch framework [60].For a fair comparison, we follow the same training protocol as DeiT [49], which is the standard protocol for training transformer networks nowadays.Specifically, the input images are randomly cropped to 224 × 224 pixels, followed by random horizontal flipping and mixup [61] for data augmentation.Label smoothing [27] is used to avoid overfitting.The AdamW optimizer [62] is adopted with the momentum of 0.9, the weight decay of 0.05, and a mini-batch size of 128 per GPU by default.The initial learning rate is set to 1e-3, which decreases following the cosine learning rate schedule [63].The training process lasts for 300 epochs on eight NVIDIA Tesla V100 GPUs.Note that for ablation studies, we utilize a minibatch size of 64 and 100 training epochs to save training time.Moreover, we set G 1 = {8, 7, 7} and G 2 = {8, 4, 2} for t-th stage (t = {2, 3, 4}), respectively.The fifth stage can be processed directly using the vanilla MHSA mechanism.For model evaluation, we apply a center crop of 224 × 224 pixels on validation images to evaluate the recognition accuracy.We report the top-1 classification accuracy on the ImageNet validation set [24] as well as the number of parameters and the number of FLOPs for each model.
Table 2 Comparison to state-of-the-art methods on the ImageNet validation set [24]."*" indicates the performance of a method using the default training setting in the original paper."#Param" and "#FLOPs" refer to the number of parameters and the number of FLOPs, respectively." †" marks models that use the input size of 384×384; otherwise, models use the input size of 224×224.Experimental results.
We compare HAT-Net with state-of-the-art network architectures, including CNN-based ones like ResNet [3], Table 3 Experimental results on the ADE20K validation dataset [68] for semantic segmentation.We replace the backbone of Semantic FPN [69] with various network architectures.The number of FLOPs is calculated with the input size of 512 × 512.
We can observe that HAT-Net achieves stateof-the-art performance.Specifically, with similar numbers of parameters and FLOPs, HAT-Net-Tiny, HAT-Net-Small, HAT-Net-Medium, and HAT-Net-Large outperforms the second best results by 1.1%, 0.6%, 0.8%, and 0.6% in terms of the top-1 accuracy, respectively.Since the performance for image classification implies the ability of a network for learning feature representations, the above comparison suggests that the proposed HAT-Net has great potential for generic scene understanding.

Semantic Segmentation
Experimental setup.We continue by applying HAT-Net to a fundamental downstream vision task, semantic segmentation, which aims at predicting a class label for each pixel in an image.

Backbone
Object Detection Instance Segmentation  [62] with weight decay of 1e-4.We apply the poly learning rate schedule with γ = 0.9 and the initial learning rate of 1e-4.During training, the batch size is 16, and each image has a resolution of 512 × 512 through resizing and cropping.During testing, each image is resized to the shorter side of 512 pixels, without multi-scale testing or flipping.We adopt the well-known MMSegmentation toolbox [73] for the above experiments.We set G 1 = {8, 8, 8} and G 2 = {16, 8, 4} for the t-th stage (t = {2, 3, 4}), respectively.

Experimental results.
The results are depicted in Table 3.We compare with typical CNN networks, i.e., ResNets [3] and ResNeXts [29], and transformer networks, i.e., Swin Transformer [18], PVT [19], PVTv2 [64] and Twins-SVT [67].As can be observed, the proposed HAT-Net achieves significantly better performance than previous competitors.Specifically, HAT-Net-Tiny, HAT-Net-Small, HAT-Net-Medium, and HAT-Net-Large attain 1.9%, 0.4%, 1.9%, and 0.7% higher mIoU than the second better results with similar number of parameters and FLOPs.This demonstrates the superiority of HAT-Net in learning effective feature representations for dense prediction tasks.

Object Detection and Instance Segmentation
Experimental setup.Since object detection and instance segmentation are also fundamental downstream vision tasks, we apply HAT-Net to both tasks for further evaluating its effectiveness.Specifically, we utilize two well-known detectors, i.e., RetinaNet [70] for object detection and Mask R-CNN [5] for instance segmentation.HAT-Net is compared to some well-known CNN and transformer networks by only replacing the backbone of the above two detectors.Experiments are conducted on the large-scale MS-COCO dataset [71] by training on the train2017 set (∼118K images) and evaluating on the val2017 set (5K images).
Table 5 Ablation studies for the hierarchical attention in HAT-Net.The configuration of HAT-Net-Small is adopted for all experiments."✔" indicates that we replace the window attention [18] with the hierarchical attention at the i-th stage."Top-1 Acc" is the top-1 accuracy on the ImageNet validation dataset [24]."mIoU" is the mean IoU for semantic segmentation on the ADE20K dataset [68].We adopt MMDection toolbox [74] for experiments and follow the experimental settings of PVT [19] for a fair comparison.During training, we initialize the backbone weights with the ImageNet-pretrained models.The detectors are fine-tuned using the AdamW optimizer [62] with an initial learning rate of 1e-4 that is decreased by 10 times after the 8 th and 11 th epochs, respectively.The whole training lasts for 12 epochs with a batch size of 16.Each image is resized to a shorter side of 800 pixels, but the longer side is not allowed to exceed 1333 pixels.We set G 1 = {8, 8, 8} and G 2 = {16, 8, 4} for the t-th stage (t = {2, 3, 4}), respectively.

Experimental results.
The results are displayed in Table 4.As can be seen, HAT-Net substantially improves the accuracy over other network architectures with a similar number of parameters.Twins-SVT [67] combines the advantages of PVT [19] and Swin Transformer [18] by alternatively stacking their basic blocks.When RetinaNet [70] is adopted as the detector, HAT-Net-Small attains 1.8%, 1.6% and 1.8% higher results than Twins-SVT-S [67] in terms of AP, AP 50 and AP 75 , respectively.Correspondingly, gets 1.0%, 0.5% and 1.5% higher results than Twins-SVT-B [67].With Mask R-CNN [5] as the detector, HAT-Net-Large achieves 2.2%, 1.7% and 2.8% higher results than Twins-SVT-B [67] in terms of bounding box metrics AP b , AP b 50 and AP b 75 , respectively.HAT-Net-Large achieves 1.6%, 2.0% and 1.8% higher results than Twins-SVT-B [67] in terms of mask metrics AP m , AP m 50 and AP m 75 , respectively.Such significant improvement in object detection and instance segmentation shows the superiority of HAT-Net in learning effective feature representations.

Ablation Studies
In this part, we evaluate various design choices of the proposed HAT-Net.As discussed above, we only train all ablation models for 100 epochs to save training time.The batch size and learning rate are also reduced by half accordingly.HAT-Net-Small is adopted for these ablation studies.
Effect of the proposed H-MHSA.Starting from the window attention [18] based transformer network, we gradually replace the window attention with our proposed H-MHSA at different stages.The results are summarized in Table 5.
Since the feature map at the fifth stage is small enough for directly computing MHSA, the fifth stage is excluded from Table 5.Note that the first stage of HAT-Net only consists of convolutions so that it is also excluded.From Table 5, we can observe that the performance for both image classification and semantic segmentation is improved when more stages adopt H-MHSA.This verifies the effectiveness of the proposed H-MHSA in feature presentation learning.It is interesting to find that the usage of H-MHSA at the fourth stage leads to more significant improvement than other stages.Intuitively, the fourth stage has the most transformer blocks, so the changes at this stage would lead to more significant effects.
A pure transformer version of HAT-Net vs. PVT [19].
When we remove all depthwise separable convolutions from HAT-Net and train the resulting transformer network for 100 epochs, it achieves 77.7% top-1 accuracy on the ImageNet validation set [24].In contrast, the wellknown transformer network, PVT [19], attains 75.8% top-1 accuracy under the same condition.This suggests that our proposed H-MHSA is very effective in feature representation learning.
Settings of G 1 and G 2 .
In HAT-Net, the parameters G 1 and G 2 play pivotal roles, controlling grid sizes for local attention calculation and downsampling rates for global attention calculation, respectively.In this evaluation, we assess the model's performance under various configurations of G 1 and G 2 .By default, for tasks such as object detection and instance segmentation, we employ G 1 = 8, 8, 8 and G 2 = 16, 8, 4 for the t-th stage (t = 2, 3, 4), respectively.Subsequently, we systematically vary G 1 and G 2 , evaluating the performance Mask R-CNN [5] with HAT-Net-Small as the backbone.The evaluation results, conducted on the MS-COCO val2017 dataset [71], are presented in Table 6.The findings indicate that HAT-Net demonstrates robustness across different G 1 and G 2 settings.Notably, altering G 1 from its default 8, 8, 8 configuration has a marginal impact on performance, resulting in slight performance reduction.Similarly, adjusting the values of G 2 yields a trade-off: decreasing values enhances performance at the expense of increased computational cost, while increasing values reduces computational cost at the cost of slightly degraded performance.Our default choice of G 1 = 8, 8, 8 and G 2 = 16, 8, 4 strikes a favorable balance between accuracy and efficiency, offering a practical configuration for general use.

Conclusion
This paper addresses the inefficiency inherent in vanilla vision transformers due to the elevated computational and space complexity associated with MHSA.In response to this challenge, we introduce a novel hierarchical framework for MHSA computation, denoted as H-MHSA, aiming to alleviate the computational and space demands.Compared to existing approaches in this domain, such as PVT [19] and Swin Transformer [18], H-MHSA distinguishes itself by directly capturing both global dependencies and local relationships.Integrating the proposed H-MHSA, we formulate the HAT-Net family, showcasing its prowess through comprehensive experiments spanning image classification, semantic segmentation, object detection, and instance segmentation.Our results affirm the efficacy and untapped potential of HAT-Net in advancing representation learning.
Applications of HAT-Net.The versatility of HAT-Net extends its utility across diverse realworld scenarios and downstream vision tasks.As a robust backbone network for feature extraction, HAT-Net seamlessly integrates with existing prediction heads and decoder networks, enabling proficient execution of various scene understanding tasks.Furthermore, HAT-Net's adaptability to different input resolutions and computational resource constraints is facilitated by the flexible adjustment of parameters, specifically G 1 and G 2 .Users can tailor HAT-Net to their specific requirements, selecting from different HAT-Net versions to align with their objectives.
In conclusion, HAT-Net not only presents a pragmatic solution to the limitations of vanilla vision transformers but also opens avenues for innovation in the future design of such architectures.The simplicity of the proposed H-MHSA underscores its potential as a transformative element in the evolving landscape of vision transformer development.
5] AP AP 50 AP 75 AP S AP M AP L

Table 6
[71]tion studies for the settings of G 1 and G 2 in HAT-Net.The performance assessment is conducted using Mask R-CNN[5]with HAT-Net-Small as the backbone.Evaluation results are reported on the MS-COCO val2017 dataset[71].