Introduction

Innovative transformer design [1] has resulted in a huge leap forward in functionalities as compared to convolutional neural networks when dealing with computer vision problems. Since then, the detection transformer (DETR) [2] has been presented as the first completely end-to-end classification model. It utilizes transformers to generate a final set of estimations directly, without additional post-processing. Convergence, on the other hand, requires an extended training period. For instance, while the widely known faster RCNN model [3] uses only 30 epochs to reach convergence, DETR demands 500 epochs, which usually takes 10 days on eight V100 GPUs. Such high training costs would be virtually increasingly expensive in large applications. As a result, determining how to accelerate the learning process to achieve rapid convergence for detection transformer-based detection techniques is a difficult research challenge. Contrastingly, visual transformer (ViT) [4], which feeds a typical transformer with a series of embedded image patches, has been the first type of convolutionless transformer to achieve equivalent performance to convolutional networks. However, ViT requires extremely huge datasets for training, such as ImageNet21K [5] and JFT300M [6]. DeiT [7] then demonstrates how data augmentation and network regularization can be used to train wide ViT models with fewer data points. Despite its capabilities, ViT has two significant shortcomings. 1) There is a quadratic rise in complexity with respect to image size, which causes the traditional ViT to be unsuitable for scaling, and 2) the attention process at the transformer encoder as well as decoder also adds substantial computational costs to the detector’s overall processing time. Since that day, ViT has sparked numerous attempts to enhance its efficiency and productivity in a variety of ways [8,9,10,11]. Such a high computational cost makes it challenging to apply transformers to the human pose estimation (HPE) [12,13,14] task.

In this work, we propose going shallow and deeper using vision transformers strictly for 2D human pose estimation (GITPose) to solve the aforementioned challenge. Conventionally, HPE tries to locate different anatomical keypoints that heavily depend on both visual information and constraint connections between keypoints such as shoulder, knee, neck, and head from image or video snippets. It is a critical topic in the field of computer vision that has drawn considerable interest from the academic and business communities. Traditional Deep convolutional neural networks (DCNNs) have recently demonstrated their capability for human pose estimation. Roughly, there are two main paradigms: (1) regression of keypoint positions [15, 16] and (2) keypoint heatmap estimation proceeded by selection of the region with the greatest score [17,18,19]. The former handles pose prediction like a joint position regression issue, and the coordinates of each joint keypoint are regressed directly as a function of the joint position. The latter method applies a 2D Gaussian kernel to every keypoint, creates ground truth heatmaps, and uses the heatmaps to monitor the prediction while accounting for L2 loss. Because heatmap regression is easier to execute than keypoint regression and also delivers high performance, it is used as the basis for the majority of state-of-the-art (SOTA) approaches.

Motivated by the efficacy of recent transformer methods for 2D human pose estimations [12, 20,21,22,23,24,25], the pyramid structure [26] of CNNs, and current hierarchical vision transformers (HVTs) [27,28,29] that split transformer units or blocks into many phases or stages and reduce feature maps even as network model structure gets deeper, we propose a GITPose, a purely convolutionless base model that integrates image patches of various sizes to generate more powerful visual keypoints for human pose estimation. Unlike previous counterparts, our technique processes keypoint tokens by utilizing shallow blocks in the initial stages. The framework limits the enormous computational cost as well as the memory footprint incurred by self-attention with regard to high-resolution map features. In addition, utilizing self-attention (SA) in the subsequent phases to identify long-term dependencies (deeper) is extremely effective owing to the continuous compressing pyramid method. Our primary objective with this study is to design simple and straightforward feature fusion hierarchical vision transformer (HVT) [28] architectures that are suitable for vision transformers and then use them for effective keypoint representation, a problem that has not been remedied to our knowledge. Our extensive experiments indicate that such a simple and straightforward network design strikes a balance between model performance and efficiency.

The most significant contributions are as follows:

  • We propose GITPose for 2D HPE, which is one of the first studies to utilize a vision transformer to extract feature representations.

  • GITPose introduces a hierarchical transformer in which we utilize MLP(s) to encode deep local feature tokens in the initial stages, whereas self-attention (SA) components are employed to encode longer relationships in the deeper stages, and a decoder layer for keypoint detection.

  • We also propose a novel deformable token association (DTA) module to fuse flexibly more valuable and informative keypoint tokens to provide hierarchical illustration and representations with increased transformation modeling capacity.

  • GITPose architecture, based on several experiments, outperforms its current keypoint detection counterparts either with or without a convolutional neural network (CNN) as a backbone and obtains new SOTA results on the two benchmark datasets, MS-COCO and MPII.

Related works

Transformers in vision

Transformers were first used for machine translation [30], and they have since been used to significantly increase the performance of machine learning techniques on a variety of natural language processing problems [31,32,33]. The use of transformers in computer vision tasks, either in conjunction with or as an alternative to convolutional neural networks (CNNs), has been shown to be incredibly efficient [29]. Significantly, visual transformer (ViT) [4], applying pure transformer models, achieved state-of-the-art accuracy on image detection and segmentation. Specifically, it is the first implementation of a transformer-based technique that can compete with or even outperform CNNs in object detection and image classification. Although this technique is very impressive, it still suffers from challenges related to quadratic complexity, which results in high computational costs and memory consumption. Since then, [8] revealed a model that uses the asymmetric attention technique to repeatedly distil inputs into something like a tight latent bottleneck, which allows it to scale to accommodate extremely huge amounts of information. Reference [11] proposes a layer-wise tokens-to-tokens (T2T) translation in place of the basic tokenization employed in ViT [4] for encoding the key local features of each token. With all the achievements of ViT, it is hardly deployed in human pose estimation, and differently from these architectures, we present HVTs in which we utilize feed-forward networks (FFNs) to encode rich and deep local feature tokens in the early or initial stages (shallow), whereas self-attention (SA) modules are employed to encode longer information in the deeper block or layers (deeper), and lastly, a decoder layer for generating richer keypoint detection.

Recently, the transformer has also attracted considerable attention and application [20, 22, 34] in human pose estimation. POET [35] revealed an encoder–decoder model that combines CNN with a transformer and directly tries regressing the pose of every person utilizing a bipartite-matching technique. Relying on this technique, TFPose [36] performed straight HPE, addressing the feature-mismatch challenge of prior regression approaches. TransPose [24] established a transformer framework for predicting the position of human poses or keypoints on the basis of heatmaps that efficiently capture or detect the spatial relations of images. TokenPose [12] followed ViT [4] by dividing the given image into many tokens to generate the visual patches, which merged the visual as well as constraint cues into a cohesive architectural model. Swin-Pose [37] presented a transformer framework to retrieve the long-term relationships between grids (pixels), employing the pretrained version of Swin Transformer [27] as the skeleton to retrieve or extract input image attributes. Furthermore, the pyramid feature model was originally developed in order to join features from many different phases in a way to improve them. In contrast to existing transformer-based techniques, our solution integrates top-down as well as bottom-up pipelines using a transformer.

Human pose estimation

Heatmap-based paradigm for 2D pose estimation [13, 38, 39] calculates the per-pixel possibility for each keypoint location and now tries to influence the study of 2D human pose estimation (HPE). Several publications [14, 40] have made an attempt to develop robust backbone networks capable of retaining high-resolution feature maps for heatmap monitoring. There is a probability assigned to each spot on the heatmap that this will be the ground truth point. Tompson et al. [41] further suggest a hybrid architecture comprising a deep convolutional network as well as a Markov random field as among the earliest applications of heatmaps. Newell et al. [42] establish the use of an hourglass design in HPE. To increase localization prediction accuracy, Papandreou et al. [43] reveal aggregating the heatmap as well as offset estimation. To achieve the correctly predicted heatmap, Xiao et al. [13] introduce a unique baseline that employs three (3) deconvolutional layers preceding a backbone network(s). The authors [14] suggest an innovative network that maintains or keeps high-resolution representations during the entire process, resulting in a considerable gain in performance. Zhang et al. [19] also provide a distribution-aware coordinate illustration or representation to curtail the error of quantization in downsampling heatmaps. Also, to address the challenges in multi-frame human pose estimation, Liu et al. [44, 45] propose DCPose and FAMI-Pose to tackle the spatial misalignment that exists between pose features.


Regression-based paradigm: Several studies have been done employing the regression-based paradigm [15, 16, 22, 46, 47] to estimate keypoints from person images. For instance, [48] suggested a cascaded deep neural network (C-DNN) dubbed DeepPose to track joint coordinates from images, adopting ALexNet as its backbone network. Because of the excellent achievement of this architecture, there has been a revolution in how to tackle human pose estimation tasks since attention has turned from normal approaches to those of the deep learning paradigm, especially in CNN. [43] went ahead to suggest a structure-aware kind of regression model, compositional pose regression, that relies on ResNet-50. This architecture uses a bone-based as well as re-parametrized representation that consists of human body data and pose structure, rather than relying on the primitive keypoint representation. [22] revealed an end-to-end training architecture based on regression for HPE through the use of soft-argmax purposely for feature map extraction into keypoints for a total differentiable network. It has therefore been suggested that a better feature that is capable of encoding rich pose information is crucial for enhancing the regression-based paradigm, and in view of that, multi-task learning [49] has been considered to be a good strategy to learn feature representation. For example, the framework can easily generalize better to the pose estimation task through the shared representations that exist within related tasks. Focusing on this area, [50] suggested a heterogeneous multi-task model that is made up of two tasks: (1) estimating keypoints from complete images through the building of a regressor, and (2) body part detection from image patches through the use of a sliding window. Also, [51] introduced a dual-source integrated Deep Conv. Net for two tasks: (1) keypoint detection, which shows whether a patch consists of a body joint, and (2) keypoint localization, which identifies the correct location of the keypoint in the patch. As a result, each task relates to a cost function, and joining the two tasks together leads to better results.

Proposed model

Revisiting transformers in vision

Let us begin by reexamining vision transformers, as this will aid us in developing an efficient model for human pose estimation. To utilize transformers in computer vision applications, a vision transformer [4] first divides an image \(X\in {\mathbb{R}}^{\mathcal{H}\times \mathcal{W}\times \mathcal{C}}\) into multiple non-overlapping tokens or patches with \(p\): patch size, as well as integrates them into vision or visual patches (i.e., \({X}_{t}\in {\mathbb{R}}^{N\times D}\)) in a patch-to-token form, where \(\mathcal{H}\): height, \(\mathcal{C}\): channel dimensions, and \(\mathcal{W}\): weight of the given input image. \(N\) denotes the number of tokens, where \(N=(\mathcal{H}\times \mathcal{W})/{p}^{2}\) and \(D\) represents the token dimension. Then, an additional learnable encoding with \(D\) as dimension, regarded as a class token, is appended to the vision tokens or patches before applying element-wise positional embeddings to each token. For the final prediction section, these tokens are passed sequentially into many transformer layers. Each transformer layer has two components: (1) multi-head self-attention (i.e., MHSA) and (2) feed-forward network (i.e., FFN). MHSA expands on single-head self-attention (i.e., SHSA) by employing unique projection matrices to represent every head. In simple terms, MHSA is the result of h repetitions of SHSA, where \(h\): number of heads. Concretely, for SHSAs, the given input tokens \({\mathcalligra{x}}_{t}\) are initially converted to \(\mathcal{K}\): keys, \(\mathcal{Q}\): queries, and \(\mathcal{V}\): values utilizing three distinct matrices, where, \(\mathcal{K},\mathcal{Q},\mathcal{V}={\mathcalligra{x}}_{t}{\mathcal{Q}}_{K}, {\mathcalligra{x}}_{t}{W}_{Q}, {\mathcalligra{x}}_{t}{\mathcal{Q}}_{V}\) and \({W}_{Q/K/ V}\in {\mathbb{R}}^{D\times \frac{D}{h}}\) represents the projection key, query, and value. The self-attention (SA) is then computed as follows:

$$Attention\left(\mathcal{Q},\mathcal{K},\mathcal{V}\right)=Softmax\left(\frac{{\mathcal{Q}\mathcal{K}}^{T}}{\sqrt{\mathcal{D}}}\right)\mathcal{V},$$
(1)

where each of the output head is of size \({R}^{D}\times \frac{D}{h}\). The output of the MHSA module is then formed by concatenating the feature(s) of all heads \(h\) with the given channel dimension. FFN is built on top of each MHSA component and individually applied to the token. It is composed of two linear transformations separated by an activation function. In addition, a normalization layer [52] as well as a shortcut is added, respectively, even before the MHSA and FNN.

Overall architecture of GITPose

Figure 1 shows the general architectural pipeline of GITPose. Let \(\mathcalligra{I}=\in {\mathbb{R}}^{H\times W\times 3}\) indicate an RGB image input with \(H\), \(W\) where \(H\): height and \(W\): width. First, we divide \(\mathcalligra{I}\) into non-overlapped tokens with an image size of \(4\times 4\), so the original feature dimension for each patch is \(4\times 4\times 3=48\). Each patch is then projected into feature dimension \({C}_{1}\) using a linear embedding layer, which serves as the new input for the subsequent pipeline of GITPose. GITPose entirely consists of four stages. Assuming that \(s\in \{1, 2, 3, 4\}\) where \(s\) denotes the index of a particular stage, we utilize \({L}_{s}\) block(s) at every phase or stage, with the initial two stages employing MLP block(s) primarily to extract rich local features/representations and the final two phases or stages employing conventional transformer block(s) [4] to resolve long-term dependencies. During every stage or phase, the given input map features are scaled to \(\frac{{H}_{S-1}}{{P}_{s}}\times \frac{{W}_{S-1}}{{P}_{s}}\times {C}_{s},\) where \({P}_{s}\): patch size and \({C}_{s}\): hidden feature dimension at stage \(s\). Then we add \({N}_{s}\) as self-attention head(s) within every block of transformer in the final two stages.

Fig. 1
figure 1

The overall architecture of GITPose. We reveal a hierarchical transformer which we utilize multilayer MLP(s) to encode deep local feature tokens in the initial stage(s), whereas SA structure is employed to encode long-term relationships in the deeper stages (layers) and a decoder for human pose estimation

Design of shallow and deeper blocks of GITPose

As illustrated in Fig. 1, GITPose utilizes two distinct kinds of blocks: FFN blocks, hereafter referred to as shallow blocks, and transformer blocks, as deeper blocks. In the initial stages, shallow blocks are utilized. Precisely, a shallow block is constructed upon an FFN architecture comprising two fully connected (FC) layers, with GELU [53] being the nonlinearity. At the stage \(c\) for every FFN, an enlargement ratio (\({\mathcal{E}}_{s}\)) was utilized. Precisely, the first (initial) FC layer increases the token's feature dimension, ranging from \({C}_{s}\) to (\({\mathcal{E}}_{s}\times \) \({C}_{s}\)), then the work of the second FC layer is to reduce it back to \({C}_{s}\). Concretely, we can have \({\Gamma \in {\mathbb{R}}}^{(\frac{{H}_{s-1}}{{P}_{s}} \times \frac{{W}_{s-1}}{{P}_{s}}) \times {C}_{s}}\) as the given input of the \(s\) th stage, where \(\mathcalligra{i}\): block index, and it is possible to express our shallow block as

$${\Gamma }_{\mathcalligra{i}}=\Psi \left(\mathcal{L}\mathcal{N}\left({\Gamma }_{\mathcalligra{i}-1}\right)\right){ +\Gamma }_{\mathcalligra{i}-1},$$
(2)

where \(\Psi (.)\) and \(\mathcal{L}\mathcal{N}(.)\) represent the FFN as well as layer norm, respectively. When \({\Gamma }_{0,}\) which symbolizes embedding features is then projected to \({\mathcal{Q}}_{0}\): queries, \({\mathcal{K}}_{0}\): keys, and \({\mathcal{V}}_{0}\): values, then queries, keys and values can be expanded by the encoder utilizing \({\Lambda }_{\mathcalligra{i}}\in {\mathbb{R}}^{{C.2}^{\mathcalligra{i}-1}\times {C.2}^{\mathcalligra{i}}}\), where \({\Lambda }_{\mathcalligra{i}}\) is the projection matrix. With this, \({\mathcal{Q}}_{\mathcalligra{i}}\), \({\mathcal{K}}_{\mathcalligra{i}}\), and \({\mathcal{V}}_{\mathcalligra{i}}\) can be computed as

$${\mathcal{Q}}_{\mathcalligra{i}}=\left({\Gamma }_{\mathcalligra{i}-1}\right){\Lambda }_{\mathcalligra{i}}+{\Theta }_{,\mathcalligra{i}},$$
$${\mathcal{K}}_{\mathcalligra{i}}=\left({\Gamma }_{\mathcalligra{i}-1}\right){\Lambda }_{\mathcalligra{i}}+{\Theta }_{,\mathcalligra{i}},$$
(3)
$${\mathcal{V}}_{\mathcalligra{i}}= \left({\Gamma }_{\mathcalligra{i}-1}\right){\Lambda }_{\mathcalligra{i}}+{\Theta }_{,\mathcalligra{i}},$$

where \({\Theta }_{,\mathcalligra{i}}\) denotes \(\mathcalligra{i}\)-index positional embedding. In Fig. 2, our deeper block as defined in ViT [4] has an MHSA layer (\(\upgamma \)) as well as an FFN in its final stages, which may be denoted as

Fig. 2
figure 2

a* Shallow block. b* Deeper block. c* Deformable token association (DTA)

$${\Gamma }_{\mathcalligra{i}-1}=\upgamma \left(\mathcal{L}\mathcal{N}\left({\Gamma }_{\mathcalligra{i}-1}\right)\right) +{\Gamma }_{\mathcalligra{i}-1},$$
(4)
$${\Gamma }_{\mathcalligra{i}}=\Psi \left(\mathcal{L}\mathcal{N}\left({\Gamma }_{i-1}\right)\right){ +\Gamma }_{i-1}.$$
(5)

This design affords our approach two significant benefits: (1) we prevent the enormous computational costs as well as the memory footprint generated by long and complex sequences in the initial phases. (2) In contrast to prior efforts that compress the maps of attention via sub-windows [27] or minimize a specific spatial feature dimension, k: key as well as the v: value matrices, we also keep typical MHSA layer(s) in the final two 2) stages to retain the ability of GITPose to manage long-term dependencies while maintaining moderate FLOP(s) owing to its design.

Deformable token association (DTA)

Prior studies on HVTs [27, 54] relied on patch (token) merging to produce pyramidal feature(s) representations. Nonetheless, they combine patches (tokens) from a regular grid while ignoring the fact that not all patches play an important role equal to that of the output channel [55]. Motivated by deformable convs (convolutions) [53, 56], we offer a DTA method as our token merging (TM) that can learn a grid/pixel of keypoint offsets to dynamically sample relevant tokens. Letting \(\Phi \) denote deformable convolution, we can express \(\Phi \) as

$$\Phi ({\Gamma )}_{\mathcalligra{t}}, : ={\sum }_{k\in \left(K\times K\right)}{\Gamma }_{\mathcalligra{t}+g\left(k\right)+\Delta g\left(k\right),:} {\Lambda }_{g\left(k\right),}$$
(6)

Unlike a conventional convolution, \(\Phi (.)\) begins to learn an offset \(\Delta g(k)\) for every predetermined offset \(g(k)\). To learn \(g\left(k\right)\), a completely separate or different convolutional layer must be deployed over the given input image feature map \(\Gamma \). To combine or merge keypoint tokens in an adaptive way, we employ one \(\Phi \) layer in a DTA network as shown in Fig. 2, defined as follows:

$$\Xi \left(\Gamma \right)=\mu (\aleph (\Phi (\Gamma ))), $$
(7)

where \(\aleph \) (.): batch normalization [57] and \(\mu (.)\): GELU nonlinearity.

Transformer decoder

To process the retrieved feature tokens from the backbone model and localize the keypoints, we employ a straightforward, lightweight decoder. This is the final stage in keypoint detection where extracted feature maps are converted into heatmaps of the specified size (i.e., 17 in the case of the COCO dataset). We adopted the classic decoder in ViTPose [25] which consists of two deconvolution blocks, each with one deconvolution layer, and then batch normalization [57] and ReLU [58]. Following the standard configuration of earlier approaches [13, 59], each block upsamples the image features twice. Next, the localization heatmaps for the keypoints are generated using a convolution layer with k × k (i.e., 1 \(\times \) 1) kernel size denoted as

$${\mathcal{V}}_{\mathcal{D}}=\mathcal{C}\mathcalligra{o}\mathcalligra{n}\mathcalligra{v}1\mathcal{D}\left({\mathfrak{D}}_{\varphi }\left({\mathfrak{D}}_{\varphi }\left(\chi \right)\right)\right),$$
(8)

where \({\mathcal{V}}_{\mathcal{D}}\) is our decoder’s estimated heatmaps values embedded using a 1D convolutional layer and \({\mathfrak{D}}_{\varphi }\) represents the deconvolution layer.

Experiments

Setup

Dataset: We evaluate GITPose on the challenging COCO object detection dataset benchmarks [60]. The dataset comprised approximately 160 k images gathered from the internet and grouped into 80 main categories. Moreover, the dataset is separated into three (3) sub-groups, i.e., train2017: 118 k images, val2017: 5 k images as well as test2017: 41 k images. regarding hpe, the coco dataset contains over 200 k pictures of over 150 k individuals annotated with 17 keypoints. It is separated into three sets, each containing a train set: 57 k images, a validation set: 5 k images, and a test-dev set: 20 k images. To facilitate comparison with SOTA architectures, we did the training by utilizing training images and provided the results for both the validation set (val-set) and test set. The traditional mean average precision (i.e., mAP) was utilized to report or provide the GITPose precision. In addition, we implemented the COCO-standard object keypoint similarity (i.e., OKS), which is formulated as follows:

$$OKS= \frac{\sum_{i}exp(-\frac{{d}_{i}^{2}}{{2s}^{2}{k}_{1}^{2}})\delta ({v}_{i}>0)}{\delta ({v}_{i}>0)}.$$
(9)

Given the 17 annotated keypoints\(i\in \{1, 2, 3, 4 . . . ,17\}\), the Euclidean distance between both the predicted keypoint and its corresponding ground truth is represented as\({d}_{i}\), where vi denotes the visibility of the ground truth, with \(s\) as the object scale, \({k}_{i}\) is the COCO constant, and 1 when the visibility is positive as well as 0 when it is negative. Furthermore, in accordance with the standard metrics usually used for COCO, we calculated the mean average precision (i.e., mAP) and also recall. @AP50, @AP75, @APS, @APM, and @APL. AR50, AR75, ARS, and ARL recall scores were calculated; “S: small”, “M: medium”, and “L: large”, respectively. We largely utilized the average precision (i.e., AP) measure, which has been the primary challenging metric in COCO, and also FLOPs, to analyze the computing overhead in comparison to other methodologies.

In contrast, we conducted a comprehensive experiment on MPII [61]. In the MPII dataset, there are around 25,000 images and 40,000 people with 16 joint labels. To provide fair comparisons, all images are cropped based on traditional training settings [13, 40]. During training purposes, we split the data randomly for the backbone network search into two parts: (1) 80% purposely for weight training operations, and (2) 20% for updating network architectural parameters.

Implementation details

We employ an NVIDIA GeForce RTX 2080 TI single GPU for training and utilize PyTorch [62] to implement our method using the mmpose library [63]. GITPose adheres to the conventional top-down configuration for pose estimation. Here, a detector is utilized to predict or detect human instances and GITPose is utilized to estimate or predict the individual keypoints of the detected pose instances. Simple baseline's [13] detection results are used to evaluate GITPose's effectiveness on the popular COCO keypoint validation set. As backbones, we employ ViT-H, ViT-L, and ViT-B; the associated models are denoted as GITPose-B, GITPose-L, and GITPose-H. The backbones are initialized using pretrained MAE [64] weights. For training the GITPose networks, the typical mmpose training parameters are used, with an input resolution of \(256\times 192\), AdamW [65] optimizer, as well as a learning rate of 5e-4. We utilized the UDP [66] protocol for performing post-processing. The architectures were trained basically for 210 epochs, including a decay of 10 in the learning rate at the 170th and 200th epochs.

Comparison with the state-of-the-art models

MPII keypoint detection

The accuracy of the state of the arts (SOTA) and GITPose is compared in terms of PCKh@0.5 in Table 1. In particular, GITPose demonstrates greater accuracy compared to its predecessors in each and every backbone design. It is interesting to observe that the training dataset provided by MPII is significantly smaller than the one provided by COCO; this suggests that our method generalizes effectively across a variety of different training data sizes. With an input resolution of 256 by 192 pixels, GITPose is able to reach an average of 93.7 PCKh while using the ViT-B backbone. Additionally, GITPose with the ViT-L backbone obtained 94.8 average PCKh with the same form of resolution, and GITPose with the ViT-H backbone obtained 94.3 average PCKh with the same input resolution. Overall, GITPose achieves a performance gain of + 0.7 compared to the best SOTA.

Table 1 Comparison with SOTA approaches on the MPII dataset (PCKh@0.5)

Coco keypoint detection

On validation set: In Table 2, the findings of the comparison between SOTA and GITPose on a val-test are presented. GITPose has several advantages, including the following: (1) the MLP block directly encodes deep or rich local feature tokens in the initial stages, which drastically decreases the computational cost; and (2) the DTA module to fuse flexibly more informative keypoint tokens substantially improves convergence and accuracy. We discover that SBL is the most efficient of the preceding networks that do not include any lightweight method; however, GITPose is able to further increase the efficiency while maintaining competitive accuracy. With an input size of 256 × 192, GITPose steadily produces a higher average precision (AP) than its counterparts with different backbone designs. GITPose obtains an average precision (AP) increase of 0.9% (76.7%, 75.8%), 0.5% (78.8%, 78.3%), and 0.9% (80.0%, 79.1%) as compared with ViTPose (i.e., ViT-B, ViT-M, and ViT-L as a backbone), respectively, with the same test resolution of 256 × 192. Additionally, our best model obtains the average precision (AP) increase of 4.4% (80.0% 75.6%), 4.2% (80.0% 75.8%), and 4.6% (80.0% 74.4%) as compare to HRFormer [69], TokenPose-L/D24 [70], and HRNet [40] with the same resolution.

Table 2 Comparison of GITPose and SOTA architectures on the COCO val

On test-dev set: We investigate the proposed approach further on the COCO test-dev set and compare it to the current best practices. The results of the GITPose and SOTA approaches are compared in Table 3. Applying standard procedure, we employ the human-bounding boxes presented by SBL [13]. We use VITPose as a backbone to establish an additional version of GITPose, which achieves greater precision than its counterparts, as shown in Table 3. Precisely, small-sized model with a test resolution of 256 × 192, we are able to attain 77.1% accuracy on MS-COCO, which is 0.2% better than the robust baseline VITPose-B [25], and a medium-sized model with the same test resolution, we achieve 79.4% accuracy, which is 0.8% better than the VITPose-M [25]. In addition, with our large-sized model, which is our best model with the same resolution, GITPose again obtains 81.1% accuracy on MS-COCO, which is 1.2%, and 5.1% better than the VITPose-L [25] and TokenPose-L/D24 [70], respectively, while maintaining similar parameters with the VITPose. Comparing GITPose-L with the best model of HRNet [40], we obtained 81.3 vs. 71.3, showing that our model performs 9.9% better than HRNet. During experiments, we also observed that GITPose performs better on a huge backbone as compared to a smaller backbone.

Table 3 Comparison of GITPose and SOTA methods on the COCO test-dev

Ablation study

To investigate the impact of self-attention on our model, we train GITPose on COCO and remove the self-attention layer after layer gradually at every stage. Tables 4 and 5 summarize the findings. First and foremost, after substituting the spatial reduction attention (SRA) layers in GITPose-B with normal MSA layers, the accuracy improves by 0.9%. This suggests that GITPose compromises between performance and efficiency. After that, by trying to remove MSA layers progressively in the first two initial stages, we observe that accuracy decreases by only 0.1% and 0.6%. It suggests that the specific self-attention (SA) layer(s) in the initial phases of our model play a less significant role in the last performance, and that they perform similarly to pure MLP layers. It is because the shallow blocks (layers) are more concerned with encoding deep local patterns. The elimination of self-attention (SA) layers in the final two stages, however, leads to a significant performance decrease. The outcomes indicate that self-attention (SA) layers (blocks) play a significant role in later stages, and taking into consideration long-term dependencies is also crucial for well-performing ViTs.

Table 4 Model details of GITPose MLP blocks
Table 5 Model details of GITPose transformer blocks. PS: patch size, CD: channel dimension. LN: number of blocks at every stage. NS: number of SA heads. We try to represent the number(s) of N3, N4. The ratio of expansion of GITPose at every stage(s) is denoted as 4, 4, 8, 8. The input resolution is 256 × 192

PS: patch size; CD: channel dimension. LN: number of blocks at every stage. NS: number of SA heads. Notably, the initial two stages do not contain SA layers. The ratio of expansion of GITPose at every stage is denoted as [4, 4, 8, 8]. Again, GITPose utilized the ratio of expansion 4 throughout the MLP blocks or layers in GITPose-B, GITPose-L, and GITPose-H. The input resolution is 256 × 192

Also, we visualize that GITPose achieves excellent results in occlusion, crowded situations, fast motion, nearby people and truncation, as depicted in Fig. 3. GITPose was capable of handling a variety of poses. In addition to that, it was able to correctly infer the occluded keypoints, whether they were self-occluded or occluded by additional objects, as shown in row 2. In the first and second rows, we showed the results of detecting more than one person in each image. These results were also promising. Our model could deal with small, blurry, as well as low-light images, as shown in row 1, column 2. We show how GITPose performs on a single image, and we observe that it does extremely well at estimating single images. We carried out an experiment to visually examine the features that were learned by the network to gain insight into how GITPose was able to achieve the performance it did. In Table 6, we show that with the same training settings, our best model, GITPose-H, achieved a competitive accuracy/speed trade-off (Fig. 4).

Fig. 3
figure 3

A visualization of some illustrative instances of the results or outcome of the keypoint detection performed on the MS-COCO dataset

Table 6 Average precision and inference speed results on COCO with the same training settings
Fig. 4
figure 4

Visualization of GITPose produced keypoint detection heatmaps on the COCO dataset. We observed from the first heatmap image that GITPose detects more accurately images that are occluded

Conclusion

In this paper, we made a solid improvement as compared to the existing state-of-the-art human pose estimation models. This was made possible by the introduction of our novel shallow and deeper vision transformer blocks for feature extraction without any convolution. Specifically, we utilize MLPs to encode very rich local feature tokens in the early or initial stages, whereas self-attention (SA) modules are employed to encode longer relationships in the deeper layer(s), and a decoder for keypoint detection. In addition, our robust learnable deformable token association module (DTA) non-uniformly and dynamically combines informative tokens. Ablation experiments demonstrate the effectiveness of GITPose. Comprehensive evaluation and testing on the COCO and MPII benchmark datasets reveal that GITPose achieved much improved accuracy of + 0.9 AP on val-set, + 1.1 AP on test-dev, and + 0.7 AP on MPII compared with its state-of-the-art counterparts. We hope our research can inspire further study on transformer-based human pose estimation approaches, which can be more popular in real-time applications.

Future work is to investigate the architecture for multi-person pose estimation and also perform temporal tracking across frames using the PoseTrack dataset. Also, the findings indicate that transformer-based approaches can be effectively used for human pose estimation (HPE) tasks, yielding comparable performance. However, a significant obstacle lies in the substantial size of transformer models, which possess about twice the number of parameters as well as six times (6x) the number of FLOPs. Additional research is required to mitigate the computational expenses and achieve superior performance in comparison to convolutional neural network (CNN) models.