Abstract
In comparison to convolutional neural networks (CNN), the newly created vision transformer (ViT) has demonstrated impressive outcomes in human pose estimation (HPE). However, (1) there is a quadratic rise in complexity with respect to image size, which causes the traditional ViT to be unsuitable for scaling, and (2) the attention process at the transformer encoder as well as decoder also adds substantial computational costs to the detector’s overall processing time. Motivated by this, we propose a novel Going shallow and deeper with vIsion Transformers for human Pose estimation (GITPose) without CNN backbones for feature extraction. In particular, we introduce a hierarchical transformer in which we utilize multilayer perceptrons to encode the richest local feature tokens in the initial phases (i.e., shallow), whereas self-attention modules are employed to encode long-term relationships in the deeper layers (i.e., deeper), and a decoder for keypoint detection. In addition, we offer a learnable deformable token association module (DTA) to non-uniformly and dynamically combine informative keypoint tokens. Comprehensive evaluation and testing on the COCO and MPII benchmark datasets reveal that GITPose achieves a competitive average precision (AP) on pose estimation compared to its state-of-the-art approaches.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Innovative transformer design [1] has resulted in a huge leap forward in functionalities as compared to convolutional neural networks when dealing with computer vision problems. Since then, the detection transformer (DETR) [2] has been presented as the first completely end-to-end classification model. It utilizes transformers to generate a final set of estimations directly, without additional post-processing. Convergence, on the other hand, requires an extended training period. For instance, while the widely known faster RCNN model [3] uses only 30 epochs to reach convergence, DETR demands 500 epochs, which usually takes 10 days on eight V100 GPUs. Such high training costs would be virtually increasingly expensive in large applications. As a result, determining how to accelerate the learning process to achieve rapid convergence for detection transformer-based detection techniques is a difficult research challenge. Contrastingly, visual transformer (ViT) [4], which feeds a typical transformer with a series of embedded image patches, has been the first type of convolutionless transformer to achieve equivalent performance to convolutional networks. However, ViT requires extremely huge datasets for training, such as ImageNet21K [5] and JFT300M [6]. DeiT [7] then demonstrates how data augmentation and network regularization can be used to train wide ViT models with fewer data points. Despite its capabilities, ViT has two significant shortcomings. 1) There is a quadratic rise in complexity with respect to image size, which causes the traditional ViT to be unsuitable for scaling, and 2) the attention process at the transformer encoder as well as decoder also adds substantial computational costs to the detector’s overall processing time. Since that day, ViT has sparked numerous attempts to enhance its efficiency and productivity in a variety of ways [8,9,10,11]. Such a high computational cost makes it challenging to apply transformers to the human pose estimation (HPE) [12,13,14] task.
In this work, we propose going shallow and deeper using vision transformers strictly for 2D human pose estimation (GITPose) to solve the aforementioned challenge. Conventionally, HPE tries to locate different anatomical keypoints that heavily depend on both visual information and constraint connections between keypoints such as shoulder, knee, neck, and head from image or video snippets. It is a critical topic in the field of computer vision that has drawn considerable interest from the academic and business communities. Traditional Deep convolutional neural networks (DCNNs) have recently demonstrated their capability for human pose estimation. Roughly, there are two main paradigms: (1) regression of keypoint positions [15, 16] and (2) keypoint heatmap estimation proceeded by selection of the region with the greatest score [17,18,19]. The former handles pose prediction like a joint position regression issue, and the coordinates of each joint keypoint are regressed directly as a function of the joint position. The latter method applies a 2D Gaussian kernel to every keypoint, creates ground truth heatmaps, and uses the heatmaps to monitor the prediction while accounting for L2 loss. Because heatmap regression is easier to execute than keypoint regression and also delivers high performance, it is used as the basis for the majority of state-of-the-art (SOTA) approaches.
Motivated by the efficacy of recent transformer methods for 2D human pose estimations [12, 20,21,22,23,24,25], the pyramid structure [26] of CNNs, and current hierarchical vision transformers (HVTs) [27,28,29] that split transformer units or blocks into many phases or stages and reduce feature maps even as network model structure gets deeper, we propose a GITPose, a purely convolutionless base model that integrates image patches of various sizes to generate more powerful visual keypoints for human pose estimation. Unlike previous counterparts, our technique processes keypoint tokens by utilizing shallow blocks in the initial stages. The framework limits the enormous computational cost as well as the memory footprint incurred by self-attention with regard to high-resolution map features. In addition, utilizing self-attention (SA) in the subsequent phases to identify long-term dependencies (deeper) is extremely effective owing to the continuous compressing pyramid method. Our primary objective with this study is to design simple and straightforward feature fusion hierarchical vision transformer (HVT) [28] architectures that are suitable for vision transformers and then use them for effective keypoint representation, a problem that has not been remedied to our knowledge. Our extensive experiments indicate that such a simple and straightforward network design strikes a balance between model performance and efficiency.
The most significant contributions are as follows:
-
We propose GITPose for 2D HPE, which is one of the first studies to utilize a vision transformer to extract feature representations.
-
GITPose introduces a hierarchical transformer in which we utilize MLP(s) to encode deep local feature tokens in the initial stages, whereas self-attention (SA) components are employed to encode longer relationships in the deeper stages, and a decoder layer for keypoint detection.
-
We also propose a novel deformable token association (DTA) module to fuse flexibly more valuable and informative keypoint tokens to provide hierarchical illustration and representations with increased transformation modeling capacity.
-
GITPose architecture, based on several experiments, outperforms its current keypoint detection counterparts either with or without a convolutional neural network (CNN) as a backbone and obtains new SOTA results on the two benchmark datasets, MS-COCO and MPII.
Related works
Transformers in vision
Transformers were first used for machine translation [30], and they have since been used to significantly increase the performance of machine learning techniques on a variety of natural language processing problems [31,32,33]. The use of transformers in computer vision tasks, either in conjunction with or as an alternative to convolutional neural networks (CNNs), has been shown to be incredibly efficient [29]. Significantly, visual transformer (ViT) [4], applying pure transformer models, achieved state-of-the-art accuracy on image detection and segmentation. Specifically, it is the first implementation of a transformer-based technique that can compete with or even outperform CNNs in object detection and image classification. Although this technique is very impressive, it still suffers from challenges related to quadratic complexity, which results in high computational costs and memory consumption. Since then, [8] revealed a model that uses the asymmetric attention technique to repeatedly distil inputs into something like a tight latent bottleneck, which allows it to scale to accommodate extremely huge amounts of information. Reference [11] proposes a layer-wise tokens-to-tokens (T2T) translation in place of the basic tokenization employed in ViT [4] for encoding the key local features of each token. With all the achievements of ViT, it is hardly deployed in human pose estimation, and differently from these architectures, we present HVTs in which we utilize feed-forward networks (FFNs) to encode rich and deep local feature tokens in the early or initial stages (shallow), whereas self-attention (SA) modules are employed to encode longer information in the deeper block or layers (deeper), and lastly, a decoder layer for generating richer keypoint detection.
Recently, the transformer has also attracted considerable attention and application [20, 22, 34] in human pose estimation. POET [35] revealed an encoder–decoder model that combines CNN with a transformer and directly tries regressing the pose of every person utilizing a bipartite-matching technique. Relying on this technique, TFPose [36] performed straight HPE, addressing the feature-mismatch challenge of prior regression approaches. TransPose [24] established a transformer framework for predicting the position of human poses or keypoints on the basis of heatmaps that efficiently capture or detect the spatial relations of images. TokenPose [12] followed ViT [4] by dividing the given image into many tokens to generate the visual patches, which merged the visual as well as constraint cues into a cohesive architectural model. Swin-Pose [37] presented a transformer framework to retrieve the long-term relationships between grids (pixels), employing the pretrained version of Swin Transformer [27] as the skeleton to retrieve or extract input image attributes. Furthermore, the pyramid feature model was originally developed in order to join features from many different phases in a way to improve them. In contrast to existing transformer-based techniques, our solution integrates top-down as well as bottom-up pipelines using a transformer.
Human pose estimation
Heatmap-based paradigm for 2D pose estimation [13, 38, 39] calculates the per-pixel possibility for each keypoint location and now tries to influence the study of 2D human pose estimation (HPE). Several publications [14, 40] have made an attempt to develop robust backbone networks capable of retaining high-resolution feature maps for heatmap monitoring. There is a probability assigned to each spot on the heatmap that this will be the ground truth point. Tompson et al. [41] further suggest a hybrid architecture comprising a deep convolutional network as well as a Markov random field as among the earliest applications of heatmaps. Newell et al. [42] establish the use of an hourglass design in HPE. To increase localization prediction accuracy, Papandreou et al. [43] reveal aggregating the heatmap as well as offset estimation. To achieve the correctly predicted heatmap, Xiao et al. [13] introduce a unique baseline that employs three (3) deconvolutional layers preceding a backbone network(s). The authors [14] suggest an innovative network that maintains or keeps high-resolution representations during the entire process, resulting in a considerable gain in performance. Zhang et al. [19] also provide a distribution-aware coordinate illustration or representation to curtail the error of quantization in downsampling heatmaps. Also, to address the challenges in multi-frame human pose estimation, Liu et al. [44, 45] propose DCPose and FAMI-Pose to tackle the spatial misalignment that exists between pose features.
Regression-based paradigm: Several studies have been done employing the regression-based paradigm [15, 16, 22, 46, 47] to estimate keypoints from person images. For instance, [48] suggested a cascaded deep neural network (C-DNN) dubbed DeepPose to track joint coordinates from images, adopting ALexNet as its backbone network. Because of the excellent achievement of this architecture, there has been a revolution in how to tackle human pose estimation tasks since attention has turned from normal approaches to those of the deep learning paradigm, especially in CNN. [43] went ahead to suggest a structure-aware kind of regression model, compositional pose regression, that relies on ResNet-50. This architecture uses a bone-based as well as re-parametrized representation that consists of human body data and pose structure, rather than relying on the primitive keypoint representation. [22] revealed an end-to-end training architecture based on regression for HPE through the use of soft-argmax purposely for feature map extraction into keypoints for a total differentiable network. It has therefore been suggested that a better feature that is capable of encoding rich pose information is crucial for enhancing the regression-based paradigm, and in view of that, multi-task learning [49] has been considered to be a good strategy to learn feature representation. For example, the framework can easily generalize better to the pose estimation task through the shared representations that exist within related tasks. Focusing on this area, [50] suggested a heterogeneous multi-task model that is made up of two tasks: (1) estimating keypoints from complete images through the building of a regressor, and (2) body part detection from image patches through the use of a sliding window. Also, [51] introduced a dual-source integrated Deep Conv. Net for two tasks: (1) keypoint detection, which shows whether a patch consists of a body joint, and (2) keypoint localization, which identifies the correct location of the keypoint in the patch. As a result, each task relates to a cost function, and joining the two tasks together leads to better results.
Proposed model
Revisiting transformers in vision
Let us begin by reexamining vision transformers, as this will aid us in developing an efficient model for human pose estimation. To utilize transformers in computer vision applications, a vision transformer [4] first divides an image \(X\in {\mathbb{R}}^{\mathcal{H}\times \mathcal{W}\times \mathcal{C}}\) into multiple non-overlapping tokens or patches with \(p\): patch size, as well as integrates them into vision or visual patches (i.e., \({X}_{t}\in {\mathbb{R}}^{N\times D}\)) in a patch-to-token form, where \(\mathcal{H}\): height, \(\mathcal{C}\): channel dimensions, and \(\mathcal{W}\): weight of the given input image. \(N\) denotes the number of tokens, where \(N=(\mathcal{H}\times \mathcal{W})/{p}^{2}\) and \(D\) represents the token dimension. Then, an additional learnable encoding with \(D\) as dimension, regarded as a class token, is appended to the vision tokens or patches before applying element-wise positional embeddings to each token. For the final prediction section, these tokens are passed sequentially into many transformer layers. Each transformer layer has two components: (1) multi-head self-attention (i.e., MHSA) and (2) feed-forward network (i.e., FFN). MHSA expands on single-head self-attention (i.e., SHSA) by employing unique projection matrices to represent every head. In simple terms, MHSA is the result of h repetitions of SHSA, where \(h\): number of heads. Concretely, for SHSAs, the given input tokens \({\mathcalligra{x}}_{t}\) are initially converted to \(\mathcal{K}\): keys, \(\mathcal{Q}\): queries, and \(\mathcal{V}\): values utilizing three distinct matrices, where, \(\mathcal{K},\mathcal{Q},\mathcal{V}={\mathcalligra{x}}_{t}{\mathcal{Q}}_{K}, {\mathcalligra{x}}_{t}{W}_{Q}, {\mathcalligra{x}}_{t}{\mathcal{Q}}_{V}\) and \({W}_{Q/K/ V}\in {\mathbb{R}}^{D\times \frac{D}{h}}\) represents the projection key, query, and value. The self-attention (SA) is then computed as follows:
where each of the output head is of size \({R}^{D}\times \frac{D}{h}\). The output of the MHSA module is then formed by concatenating the feature(s) of all heads \(h\) with the given channel dimension. FFN is built on top of each MHSA component and individually applied to the token. It is composed of two linear transformations separated by an activation function. In addition, a normalization layer [52] as well as a shortcut is added, respectively, even before the MHSA and FNN.
Overall architecture of GITPose
Figure 1 shows the general architectural pipeline of GITPose. Let \(\mathcalligra{I}=\in {\mathbb{R}}^{H\times W\times 3}\) indicate an RGB image input with \(H\), \(W\) where \(H\): height and \(W\): width. First, we divide \(\mathcalligra{I}\) into non-overlapped tokens with an image size of \(4\times 4\), so the original feature dimension for each patch is \(4\times 4\times 3=48\). Each patch is then projected into feature dimension \({C}_{1}\) using a linear embedding layer, which serves as the new input for the subsequent pipeline of GITPose. GITPose entirely consists of four stages. Assuming that \(s\in \{1, 2, 3, 4\}\) where \(s\) denotes the index of a particular stage, we utilize \({L}_{s}\) block(s) at every phase or stage, with the initial two stages employing MLP block(s) primarily to extract rich local features/representations and the final two phases or stages employing conventional transformer block(s) [4] to resolve long-term dependencies. During every stage or phase, the given input map features are scaled to \(\frac{{H}_{S-1}}{{P}_{s}}\times \frac{{W}_{S-1}}{{P}_{s}}\times {C}_{s},\) where \({P}_{s}\): patch size and \({C}_{s}\): hidden feature dimension at stage \(s\). Then we add \({N}_{s}\) as self-attention head(s) within every block of transformer in the final two stages.
Design of shallow and deeper blocks of GITPose
As illustrated in Fig. 1, GITPose utilizes two distinct kinds of blocks: FFN blocks, hereafter referred to as shallow blocks, and transformer blocks, as deeper blocks. In the initial stages, shallow blocks are utilized. Precisely, a shallow block is constructed upon an FFN architecture comprising two fully connected (FC) layers, with GELU [53] being the nonlinearity. At the stage \(c\) for every FFN, an enlargement ratio (\({\mathcal{E}}_{s}\)) was utilized. Precisely, the first (initial) FC layer increases the token's feature dimension, ranging from \({C}_{s}\) to (\({\mathcal{E}}_{s}\times \) \({C}_{s}\)), then the work of the second FC layer is to reduce it back to \({C}_{s}\). Concretely, we can have \({\Gamma \in {\mathbb{R}}}^{(\frac{{H}_{s-1}}{{P}_{s}} \times \frac{{W}_{s-1}}{{P}_{s}}) \times {C}_{s}}\) as the given input of the \(s\) th stage, where \(\mathcalligra{i}\): block index, and it is possible to express our shallow block as
where \(\Psi (.)\) and \(\mathcal{L}\mathcal{N}(.)\) represent the FFN as well as layer norm, respectively. When \({\Gamma }_{0,}\) which symbolizes embedding features is then projected to \({\mathcal{Q}}_{0}\): queries, \({\mathcal{K}}_{0}\): keys, and \({\mathcal{V}}_{0}\): values, then queries, keys and values can be expanded by the encoder utilizing \({\Lambda }_{\mathcalligra{i}}\in {\mathbb{R}}^{{C.2}^{\mathcalligra{i}-1}\times {C.2}^{\mathcalligra{i}}}\), where \({\Lambda }_{\mathcalligra{i}}\) is the projection matrix. With this, \({\mathcal{Q}}_{\mathcalligra{i}}\), \({\mathcal{K}}_{\mathcalligra{i}}\), and \({\mathcal{V}}_{\mathcalligra{i}}\) can be computed as
where \({\Theta }_{,\mathcalligra{i}}\) denotes \(\mathcalligra{i}\)-index positional embedding. In Fig. 2, our deeper block as defined in ViT [4] has an MHSA layer (\(\upgamma \)) as well as an FFN in its final stages, which may be denoted as
This design affords our approach two significant benefits: (1) we prevent the enormous computational costs as well as the memory footprint generated by long and complex sequences in the initial phases. (2) In contrast to prior efforts that compress the maps of attention via sub-windows [27] or minimize a specific spatial feature dimension, k: key as well as the v: value matrices, we also keep typical MHSA layer(s) in the final two 2) stages to retain the ability of GITPose to manage long-term dependencies while maintaining moderate FLOP(s) owing to its design.
Deformable token association (DTA)
Prior studies on HVTs [27, 54] relied on patch (token) merging to produce pyramidal feature(s) representations. Nonetheless, they combine patches (tokens) from a regular grid while ignoring the fact that not all patches play an important role equal to that of the output channel [55]. Motivated by deformable convs (convolutions) [53, 56], we offer a DTA method as our token merging (TM) that can learn a grid/pixel of keypoint offsets to dynamically sample relevant tokens. Letting \(\Phi \) denote deformable convolution, we can express \(\Phi \) as
Unlike a conventional convolution, \(\Phi (.)\) begins to learn an offset \(\Delta g(k)\) for every predetermined offset \(g(k)\). To learn \(g\left(k\right)\), a completely separate or different convolutional layer must be deployed over the given input image feature map \(\Gamma \). To combine or merge keypoint tokens in an adaptive way, we employ one \(\Phi \) layer in a DTA network as shown in Fig. 2, defined as follows:
where \(\aleph \) (.): batch normalization [57] and \(\mu (.)\): GELU nonlinearity.
Transformer decoder
To process the retrieved feature tokens from the backbone model and localize the keypoints, we employ a straightforward, lightweight decoder. This is the final stage in keypoint detection where extracted feature maps are converted into heatmaps of the specified size (i.e., 17 in the case of the COCO dataset). We adopted the classic decoder in ViTPose [25] which consists of two deconvolution blocks, each with one deconvolution layer, and then batch normalization [57] and ReLU [58]. Following the standard configuration of earlier approaches [13, 59], each block upsamples the image features twice. Next, the localization heatmaps for the keypoints are generated using a convolution layer with k × k (i.e., 1 \(\times \) 1) kernel size denoted as
where \({\mathcal{V}}_{\mathcal{D}}\) is our decoder’s estimated heatmaps values embedded using a 1D convolutional layer and \({\mathfrak{D}}_{\varphi }\) represents the deconvolution layer.
Experiments
Setup
Dataset: We evaluate GITPose on the challenging COCO object detection dataset benchmarks [60]. The dataset comprised approximately 160 k images gathered from the internet and grouped into 80 main categories. Moreover, the dataset is separated into three (3) sub-groups, i.e., train2017: 118 k images, val2017: 5 k images as well as test2017: 41 k images. regarding hpe, the coco dataset contains over 200 k pictures of over 150 k individuals annotated with 17 keypoints. It is separated into three sets, each containing a train set: 57 k images, a validation set: 5 k images, and a test-dev set: 20 k images. To facilitate comparison with SOTA architectures, we did the training by utilizing training images and provided the results for both the validation set (val-set) and test set. The traditional mean average precision (i.e., mAP) was utilized to report or provide the GITPose precision. In addition, we implemented the COCO-standard object keypoint similarity (i.e., OKS), which is formulated as follows:
Given the 17 annotated keypoints\(i\in \{1, 2, 3, 4 . . . ,17\}\), the Euclidean distance between both the predicted keypoint and its corresponding ground truth is represented as\({d}_{i}\), where vi denotes the visibility of the ground truth, with \(s\) as the object scale, \({k}_{i}\) is the COCO constant, and 1 when the visibility is positive as well as 0 when it is negative. Furthermore, in accordance with the standard metrics usually used for COCO, we calculated the mean average precision (i.e., mAP) and also recall. @AP50, @AP75, @APS, @APM, and @APL. AR50, AR75, ARS, and ARL recall scores were calculated; “S: small”, “M: medium”, and “L: large”, respectively. We largely utilized the average precision (i.e., AP) measure, which has been the primary challenging metric in COCO, and also FLOPs, to analyze the computing overhead in comparison to other methodologies.
In contrast, we conducted a comprehensive experiment on MPII [61]. In the MPII dataset, there are around 25,000 images and 40,000 people with 16 joint labels. To provide fair comparisons, all images are cropped based on traditional training settings [13, 40]. During training purposes, we split the data randomly for the backbone network search into two parts: (1) 80% purposely for weight training operations, and (2) 20% for updating network architectural parameters.
Implementation details
We employ an NVIDIA GeForce RTX 2080 TI single GPU for training and utilize PyTorch [62] to implement our method using the mmpose library [63]. GITPose adheres to the conventional top-down configuration for pose estimation. Here, a detector is utilized to predict or detect human instances and GITPose is utilized to estimate or predict the individual keypoints of the detected pose instances. Simple baseline's [13] detection results are used to evaluate GITPose's effectiveness on the popular COCO keypoint validation set. As backbones, we employ ViT-H, ViT-L, and ViT-B; the associated models are denoted as GITPose-B, GITPose-L, and GITPose-H. The backbones are initialized using pretrained MAE [64] weights. For training the GITPose networks, the typical mmpose training parameters are used, with an input resolution of \(256\times 192\), AdamW [65] optimizer, as well as a learning rate of 5e-4. We utilized the UDP [66] protocol for performing post-processing. The architectures were trained basically for 210 epochs, including a decay of 10 in the learning rate at the 170th and 200th epochs.
Comparison with the state-of-the-art models
MPII keypoint detection
The accuracy of the state of the arts (SOTA) and GITPose is compared in terms of PCKh@0.5 in Table 1. In particular, GITPose demonstrates greater accuracy compared to its predecessors in each and every backbone design. It is interesting to observe that the training dataset provided by MPII is significantly smaller than the one provided by COCO; this suggests that our method generalizes effectively across a variety of different training data sizes. With an input resolution of 256 by 192 pixels, GITPose is able to reach an average of 93.7 PCKh while using the ViT-B backbone. Additionally, GITPose with the ViT-L backbone obtained 94.8 average PCKh with the same form of resolution, and GITPose with the ViT-H backbone obtained 94.3 average PCKh with the same input resolution. Overall, GITPose achieves a performance gain of + 0.7 compared to the best SOTA.
Coco keypoint detection
On validation set: In Table 2, the findings of the comparison between SOTA and GITPose on a val-test are presented. GITPose has several advantages, including the following: (1) the MLP block directly encodes deep or rich local feature tokens in the initial stages, which drastically decreases the computational cost; and (2) the DTA module to fuse flexibly more informative keypoint tokens substantially improves convergence and accuracy. We discover that SBL is the most efficient of the preceding networks that do not include any lightweight method; however, GITPose is able to further increase the efficiency while maintaining competitive accuracy. With an input size of 256 × 192, GITPose steadily produces a higher average precision (AP) than its counterparts with different backbone designs. GITPose obtains an average precision (AP) increase of 0.9% (76.7%, 75.8%), 0.5% (78.8%, 78.3%), and 0.9% (80.0%, 79.1%) as compared with ViTPose (i.e., ViT-B, ViT-M, and ViT-L as a backbone), respectively, with the same test resolution of 256 × 192. Additionally, our best model obtains the average precision (AP) increase of 4.4% (80.0% 75.6%), 4.2% (80.0% 75.8%), and 4.6% (80.0% 74.4%) as compare to HRFormer [69], TokenPose-L/D24 [70], and HRNet [40] with the same resolution.
On test-dev set: We investigate the proposed approach further on the COCO test-dev set and compare it to the current best practices. The results of the GITPose and SOTA approaches are compared in Table 3. Applying standard procedure, we employ the human-bounding boxes presented by SBL [13]. We use VITPose as a backbone to establish an additional version of GITPose, which achieves greater precision than its counterparts, as shown in Table 3. Precisely, small-sized model with a test resolution of 256 × 192, we are able to attain 77.1% accuracy on MS-COCO, which is 0.2% better than the robust baseline VITPose-B [25], and a medium-sized model with the same test resolution, we achieve 79.4% accuracy, which is 0.8% better than the VITPose-M [25]. In addition, with our large-sized model, which is our best model with the same resolution, GITPose again obtains 81.1% accuracy on MS-COCO, which is 1.2%, and 5.1% better than the VITPose-L [25] and TokenPose-L/D24 [70], respectively, while maintaining similar parameters with the VITPose. Comparing GITPose-L with the best model of HRNet [40], we obtained 81.3 vs. 71.3, showing that our model performs 9.9% better than HRNet. During experiments, we also observed that GITPose performs better on a huge backbone as compared to a smaller backbone.
Ablation study
To investigate the impact of self-attention on our model, we train GITPose on COCO and remove the self-attention layer after layer gradually at every stage. Tables 4 and 5 summarize the findings. First and foremost, after substituting the spatial reduction attention (SRA) layers in GITPose-B with normal MSA layers, the accuracy improves by 0.9%. This suggests that GITPose compromises between performance and efficiency. After that, by trying to remove MSA layers progressively in the first two initial stages, we observe that accuracy decreases by only 0.1% and 0.6%. It suggests that the specific self-attention (SA) layer(s) in the initial phases of our model play a less significant role in the last performance, and that they perform similarly to pure MLP layers. It is because the shallow blocks (layers) are more concerned with encoding deep local patterns. The elimination of self-attention (SA) layers in the final two stages, however, leads to a significant performance decrease. The outcomes indicate that self-attention (SA) layers (blocks) play a significant role in later stages, and taking into consideration long-term dependencies is also crucial for well-performing ViTs.
PS: patch size; CD: channel dimension. LN: number of blocks at every stage. NS: number of SA heads. Notably, the initial two stages do not contain SA layers. The ratio of expansion of GITPose at every stage is denoted as [4, 4, 8, 8]. Again, GITPose utilized the ratio of expansion 4 throughout the MLP blocks or layers in GITPose-B, GITPose-L, and GITPose-H. The input resolution is 256 × 192
Also, we visualize that GITPose achieves excellent results in occlusion, crowded situations, fast motion, nearby people and truncation, as depicted in Fig. 3. GITPose was capable of handling a variety of poses. In addition to that, it was able to correctly infer the occluded keypoints, whether they were self-occluded or occluded by additional objects, as shown in row 2. In the first and second rows, we showed the results of detecting more than one person in each image. These results were also promising. Our model could deal with small, blurry, as well as low-light images, as shown in row 1, column 2. We show how GITPose performs on a single image, and we observe that it does extremely well at estimating single images. We carried out an experiment to visually examine the features that were learned by the network to gain insight into how GITPose was able to achieve the performance it did. In Table 6, we show that with the same training settings, our best model, GITPose-H, achieved a competitive accuracy/speed trade-off (Fig. 4).
Conclusion
In this paper, we made a solid improvement as compared to the existing state-of-the-art human pose estimation models. This was made possible by the introduction of our novel shallow and deeper vision transformer blocks for feature extraction without any convolution. Specifically, we utilize MLPs to encode very rich local feature tokens in the early or initial stages, whereas self-attention (SA) modules are employed to encode longer relationships in the deeper layer(s), and a decoder for keypoint detection. In addition, our robust learnable deformable token association module (DTA) non-uniformly and dynamically combines informative tokens. Ablation experiments demonstrate the effectiveness of GITPose. Comprehensive evaluation and testing on the COCO and MPII benchmark datasets reveal that GITPose achieved much improved accuracy of + 0.9 AP on val-set, + 1.1 AP on test-dev, and + 0.7 AP on MPII compared with its state-of-the-art counterparts. We hope our research can inspire further study on transformer-based human pose estimation approaches, which can be more popular in real-time applications.
Future work is to investigate the architecture for multi-person pose estimation and also perform temporal tracking across frames using the PoseTrack dataset. Also, the findings indicate that transformer-based approaches can be effectively used for human pose estimation (HPE) tasks, yielding comparable performance. However, a significant obstacle lies in the substantial size of transformer models, which possess about twice the number of parameters as well as six times (6x) the number of FLOPs. Additional research is required to mitigate the computational expenses and achieve superior performance in comparison to convolutional neural network (CNN) models.
Data availability statement
Not applicable.
References
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; 2017; Vol. 2017-December.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 2020; Vol. 12346 LNCS.
Sun, X.; Wu, P.; Hoi, S.C.H. Face Detection Using Deep Learning: An Improved Faster RCNN Approach. Neurocomputing 2018, 299, doi:https://doi.org/10.1016/j.neucom.2018.03.030.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020.
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Kai Li; Li Fei-Fei ImageNet: A Large-Scale Hierarchical Image Database.; Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2010.
Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; 2017; Vol. 2017-October.
Touvron, H.; Massa, F.; Cord, M.; Sablayrolles, A. Training Data-Efficient Image Transformers & Distillation through Attention ArXiv : 2012 . 12877v2 [ Cs. CV] 15 Jan 2021. ArXiv 2021.
Jaegle, A.; Gimeno, F.; Brock, A.; Zisserman, A.; Vinyals, O.; Carreira, J. Perceiver: General Perception with Iterative Attention. arXiv:2103.03206 [cs.CV] 2021.
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; 2021.
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. arXiv:2103.00112 [cs.CV] 2021.
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; 2021.
Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.T.; Zhou, E. TokenPose: Learning Keypoint Tokens for Human Pose Estimation. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; 2021.
Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 2018; Vol. 11210 LNCS.
Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherhrNet: Scale-Aware Representation Learning for Bottom-up Human Pose Estimation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2020.
Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 2018; Vol. 11210 LNCS.
Tian, Z.; Chen, H.; Shen, C. DirectPose: Direct End-to-End Multi-Person Pose Estimation; arXiv:1911.07451 [cs.CV]
Newell, A.; Huang, Z.; Deng, J. Associative Embedding: End-to-End Learning for Joint Detection and Grouping. In Proceedings of the Advances in Neural Information Processing Systems; 2017; Vol. 2017-December.
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2018.
Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-Aware Coordinate Representation for Human Pose Estimation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2020.
Aidoo, E.; Wang, X.; Liu, Z.; Tenagyei, E.K.; Owusu-Agyemang, K.; Kodjiku, S.L.; Ejianya, V.N.; Aggrey, E.S.E.B. Cofopose: Conditional 2D Pose Estimation with Transformers. Sensors 2022, 22, doi:https://doi.org/10.3390/s22186821.
Ma, H.; Wang, Z.; Chen, Y.; Kong, D.; Chen, L.; Liu, X.; Yan, X.; Tang, H.; Xie, X. PPT: Token-Pruned Pose Transformer for Monocular and Multi-View Human Pose Estimation. arXiv:2209.08194 [cs.CV] 2022.
Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose Recognition with Cascade Transformers; arXiv:2104.06976 [cs.CV]
Panteleris, P.; Argyros, A. PE-Former: Pose Estimation Transformer. arXiv:2112.04981 [cs.CV] 2021.
Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint Localization via Transformer. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; 2021.
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. arXiv:2204.12484 [cs.CV] 2022.
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017; 2017; Vol. 2017-January.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; 2021.
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; 2021.
Yan, H.; Li, Z.; Li, W.; Wang, C.; Wu, M.; Zhang, C. ConTNet: Why Not Use Convolution and Transformer at the Same Time? arXiv:2104.13497 [cs.CV] 2021.
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; 2020; Vol. 2020-December.
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing.; In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics, 2020.
Merkx, D.; Frank, S.L. Human Sentence Processing: Recurrence or Attention? In Proceedings of the CMCL 2021 - Workshop on Cognitive Modeling and Computational Linguistics, Proceedings; 2021.
Zhang, S.; Loweimi, E.; Bell, P.; Renals, S. On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings; 2021.
Lin, K.; Wang, L.; Liu, Z. End-to-End Human Pose and Mesh Reconstruction with Transformers. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2021.
Jantos, T.; Hamdad, M.A.; Granig, W.; Weiss, S.; Steinbrener, J. PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation. . arXiv:2211.14125 [cs.CV] 2022.
Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z. TFPose: Direct Human Pose Estimation with Transformers. arXiv:2103.15320 [cs.CV] 2021.
Xiong, Z.; Wang, C.; Li, Y.; Luo, Y.; Cao, Y. Swin-Pose: Swin Transformer Based Human Pose Estimation. arXiv:2201.07384 [cs.CV] 2022.
Kreiss, S.; Bertoni, L.; Alahi, A. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2019; Vol. 2019-June.
Papandreou, G.; Zhu, T.; Chen, L.-C.; Gidaris, S.; Tompson, J.; Murphy, K. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. arXiv:1803.08225 [cs.CV] 2018.
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2019; Vol. 2019-June.
Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient Object Localization Using Convolutional Networks. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2015; Vol. 07–12-June-2015.
Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. arXiv:1603.06937 [cs.CV] 2016.
Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards Accurate Multi-Person Pose Estimation in the Wild. In Proceedings of the Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017; 2017; Vol. 2017-January.
Liu, Z.; Chen, H.; Feng, R.; Wu, S.; Ji, S.; Yang, B.; Wang, X. Deep Dual Consecutive Network for Human Pose Estimation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2021.
Liu, Z.; Feng, R.; Chen, H.; Wu, S.; Gao, Y.; Gao, Y.; Wang, X. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation. 2022.
Zhou, X.; Wang, D.; Krähenbühl, P. CenterNet: Objects as Points. arXiv:1904.07850 [cs.CV]. 2019.
Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2016; Vol. 2016-December.
Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2014.
Tao, C.; Jiang, Q.; Duan, L.; Luo, P. Dynamic and Static Context-Aware LSTM for Multi-Agent Motion Prediction. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 2020; Vol. 12366 LNCS.
Tsai, Y.-H.H.; Goh, H.; Farhadi, A.; Zhang, J. Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents. arXiv:2209.13156 [cs.CV] 2022.
Iqbal, U.; Doering, A.; Yasin, H.; Krüger, B.; Weber, A.; Gall, J. A Dual-Source Approach for 3D Human Pose Estimation from a Single Image. arXiv:1705.02883 [cs.CV] 2017.
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv:1607.06450 [stat.ML] 2016.
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J.; Research, S. DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION;
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going Deeper with Image Transformers. arXiv:2103.17239 [cs.CV] 2021.
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; 2016.
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; 2017; Vol. 2017-October.
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015; 2015; Vol. 1.
Fred Agarap, A.M. Deep Learning Using Rectified Linear Units (ReLU); arXiv:1803.08375 [cs.NE]
Zhang, J.; Chen, Z.; Tao, D. Towards High Performance Human Keypoint Detection. Int J Comput Vis 2021, 129, doi:https://doi.org/10.1007/s11263-021-01482-8.
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv:1405.0312 [cs.CV] 2014.
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis; Computer Vision and Pattern Recognition CVPR 2014.
GitHub - Pytorch/Pytorch: Tensors and Dynamic Neural Networks in Python with Strong GPU Acceleration Available online: https://github.com/pytorch/pytorch (accessed on 21 January 2023).
GitHub - Open-Mmlab/Mmpose: OpenMMLab Pose Estimation Toolbox and Benchmark. Available online: https://github.com/open-mmlab/mmpose (accessed on 21 January 2023).
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. arXiv:2111.06377 [cs.CV] 2021.
Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings; 2018.
Huang, J.; Zhu, Z.; Guo, F.; Huang, G. The Devil Is in the Details: Delving into Unbiased Data Processing for Human Pose Estimation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2020.
Su, Z.; Ye, M.; Zhang, G.; Dai, L.; Sheng, J. Cascade Feature Aggregation for Human Pose Estimation. arXiv:1902.07837 [cs.CV] 2019.
Bin, Y.; Cao, X.; Chen, X.; Ge, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Gao, C.; Sang, N. Adversarial Semantic Data Augmentation for Human Pose Estimation. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 2020; Vol. 12364 LNCS.
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-Resolution Transformer for Dense Prediction. In Proceedings of the Advances in Neural Information Processing Systems; 2021; Vol. 9.
Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.-T.; Zhou, E. TokenPose: Learning Keypoint Tokens for Human Pose Estimation. arXiv:2104.03516 [cs.CV] 2021
Acknowledgements
This work was supported by Zhejiang Provincial Natural Science Foundation (No. LZ23F020004), the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (No. 2023C01042), the Zhejiang Gongshang University “Digital+” Disciplinary Construction Management Project (Project Number SZJ2022A02, SZJ2022C005).
Funding
This work was supported by Zhejiang Provincial Natural Science Foundation (No. LZ23F020004), the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (No. 2023C01042), the Zhejiang Gongshang University “Digital+” Disciplinary Construction Management Project (Project Number SZJ2022A02, SZJ2022C005).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Aidoo, E., Wang, X., Liu, Z. et al. GITPose: going shallow and deeper using vision transformers for human pose estimation. Complex Intell. Syst. 10, 4507–4520 (2024). https://doi.org/10.1007/s40747-024-01361-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-024-01361-y