1 Introduction

In recent years, research on humanoid robots is getting more and more attention due to their versatile applications such as assisting elderly or physically challenged people, healthcare, public entertainment, personal care, education, search-rescue operations, and manufacturing. It has been observed that many humanoid robots are mimicking human behavior such as walking, talking, grasping, etc. Based on the above applications researchers started to develop humanoid robots and developed the first humanoid robot in 1930 in the USA. Later on, in 1966, Waseda University developed a humanoid robot that is, WABIAN-II, and in 1996, Honda Corporation developed a humanoid robot known as ASIMO. Due to the usage of humanoid robots in various fields, many organizations such as Toyota, Samsung, Hanson Robotics, NASA, Boston Dynamics, MIT, UBTECH, Columbia University, etc. developed various versions of the humanoid robot [1, 2]. Many humanoid robots are capable of intelligent behavior due to recent advancements in artificial intelligence (AI), machine learning (ML), computer vision, cognitive computing, natural language processing, and accelerated hardware. The above-said techniques are helpful in extracting useful information from its environment through sensors. Computer vision and artificial intelligence provide a new perspective to humanoid robots for their basic actions like walking and grasp manipulation. Moreover, the application of computer vision to contextualize, visualize, and react to their environment can be predominant. It has been made that computer vision techniques are the building blocks for image and video processing. It is mainly concerned with object detection, image processing, gesture recognition, image segmentation, object tracking, and pose estimation. One of the most important tasks is to estimate the human pose and track the various landmarks (joint locations). Human pose estimation (HPE) predicts and classifies the posture of the human body and its joint locations in an image or video format. The capturing method of 2D/3D joint coordinates of the shoulder, elbow, wrist, knees, ankles, arms, eyes, and ears, are the key points to describe the pose of a human. There are two main categories of pose estimation techniques, (i) 2D pose estimation: this extracts the x and y coordinates of joint location for all joint landmarks. (ii) 3D pose estimation: this extracts z-coordinates or depth information along with (x, y) coordinates. This pose estimation can be further categorized as kinematic-based, shape or contour-based, and volume-based models [3,4,5,6,7,8]. Many researchers are using skeleton tracking algorithms that can be based on the classical approach [9] as well as intelligent approaches [10,11,12,13,14]. It has been observed that researchers around the world are using deep CNN architectures for human pose estimations, some of them are listed in Table 1. Bujalance and Moutarde [15] presented a real-time control of the universal robot arm using a pose estimation framework. In this work, the authors adopted the open pose and human mesh recovery (HMR) frameworks. Later on, they calculated the inverse kinematics (IK) and forward kinematics (FK) to calculate the joint angles from the given pose key points. Chamorro et al. [16] proposed a lidar-based gesture recognition system to control the mobile robot for teleoperation. The authors adopt the long short-term memory (LSTM) and CNN architecture for pose estimation. The proposed work uses static and dynamic input from the lidar and with the help of Euclidean clustering initial pose is extracted from a point cloud. Zimmermann et al. [17] presented a human pose estimation framework using the open pose library and Voxel Pose Net. The adopted Voxel Pose Net is inspired by U-Net or also known as encoder-decoder neural network architecture. In this work, the PR2 robot is used to imitate the action of actors using the pose estimation framework and compared with marker-based estimation techniques. Gago et al. [18] discussed the application of the LSTM network to convert the natural language into Spanish sign language. To understand sign language, the authors used human skeleton or pose estimation. Therefore, they adopted open pose and the skeleton retriever library for further acquisition of joints. Finally, these sign languages are tested on the TEO humanoid robot. Amini et al. [19] proposed a novel deep-learning model for the 2D pose estimation of a humanoid robot. In the current research work, the authors introduced a humanoid robot pose dataset and the current model is working based on the bottom-up single-stage encoder-decoder architecture. It is an efficient algorithm when compared top-down approaches. Further, a comparative study has been made with other states-of-art models. Michel et al. [20] presented a marker less 3D human pose estimation method for tracking joint locations. The authors adopted three different approaches namely OpenNI, HYBRID, and FHBT. A 3D human pose estimation is used for positioning of NAO robot arm and conducted a comparative study for all three adopted methods. Later on, Liang et al. [21] proposed a vision-based marker less pose estimation framework for articulated construction robots. The authors used, a stacked hourglass deep neural network to estimate the joint locations for an articulated robot. The concept is similar to human pose estimation but it has been used to extract the joint information of articulated robots. Cai et al. [22] discussed a patient’s upper limb motion tracking using a Kinect depth camera with VICON markers. The Barret WAM manipulator is used to track the patient's upper limb movement for the rehabilitation exercise. Later on, Kinect v2 with VICON markers is used to extract the pose information. Finally, qualitative analysis has been made on joint angles and velocities.

Table 1 Human pose estimation frameworks

Gao et al. [23] proposed a parallel deep neural network model to estimate the body pose and dual hand detection. In this work, ResNet-Inception layers and Single Shot MultiBox Detector (RI-SSD) are parallelly used to detect dual hands. On the other hand, VGG-19 architecture with the COCO dataset is used for human pose estimation. Based on the information on RI-SSD and VGG-19 architecture left and right hands are classified. The said information on hand detection with pose estimation is further tested on a second-generation astronaut assistant robot. Moreover, Hernandez et al. [24] presented a human pose estimation system using a double Kinect sensor to get the actual joint variables and locations. Kinect v2 and Albuquerque NM depth sensors are used to extract skeleton information of humans and this information is called ground truth. Further, two different state-of-art HPE frameworks namely OpenPose and Detectron 2 are used to compare the joint landmarks of human pos. Finally, the joint angles of the shoulder and elbow extracted from OpenPose and Detectron2 have been compared with ground truth. McNally et al. [12] proposed a neuro-evolution architecture that is based on a 2D convolution neural network along with a weight transfer function. The efficiency of the proposed model was increased using a multi-optimization method for validation loss. Jin et al. [25] developed a top-down approach called ZoomNet which is based on Faster RCNN and a new COCO-whole body dataset with manual annotation of four bounding boxes and 133 key points. Tu et al. [26] discussed a cuboidal proposal network (CPN) with a pose regression network (PRN) which is based on voxel-to-voxel network 3D convolutions as a building block. Dai et al. [11] proposed a cascaded hierarchical CNN architecture known as 4CHNet for RGB image-based 3D hand pose estimation. Plantard et al. [27] presented the Kinect sensor-based ergonomic analysis of virtual mannequin posture analysis. In this work, the joint landmarks along with rapid upper body assessment (RULA) analysis have been made using a Kinect sensor. Bashirov et al. [28] developed real-time RGB depth-based pose estimation in 3D. For obtaining the real-time pose estimation, hand pose, and facial expression the authors used Kinect RGB-D camera. In addition, Zhang et al. [29] proposed a new method for pose estimation using a Kinect sensor with a perspective n-points (PnP) algorithm. The proposed PnP algorithm is used to get the relative position of various cameras and to map real 3D points of space with the 2D camera image. Sarsfield et al. [30] introduced a clinical assessment of human posture using a Kinect sensor. The authors performed a comprehensive analysis for pose estimation in rehabilitation applications. They worked on upper body pose estimation for stroke rehabilitation cases. They concluded that pose estimation yields significant errors when comparing the joint variables of the shoulder, arm, and elbow. Saeed et al. [5] proposed a frame-based approach for head pose estimation using a haar-cascade algorithm. They created a frame using a 2D color image with a 3D depth point cloud using feature extraction. Wu et al. [31] discussed a model based recursive matching algorithm for the pose estimation. This algorithm uses a 2D image with 3D point cloud data as an input for further training the model to fit. The proposed algorithm has been compared with Kinect real-time pose estimation and the obtained results shows higher accuracy. Obdrzalek et al. [32] presented the accuracy of joint localization and robustness of pose estimation with respect to orientation and occlusion using a Kinect sensor. They have used an impulse motion capture system for tracking LED markers attached to various joint locations. This work gives the accuracy of Kinect pose estimation using motion capture for the training of elderly people. Further, a more detailed and comprehensive study of the works of literature can be found in an article by Bazarevsky et al. [33].

Based on the above literature, it has been observed that many researchers are contributing to deep learning-based pose-tracking algorithms. On the other hand, Kinect v1 and v2 sensors are frequently used to create 3D point clouds and datasets for further HPE. The main challenge of the HPE algorithm is real-time implementation and minimization of joint angle errors. Therefore, a real-time inverse kinematic solver is employed to calculate the joint angles for the given elbow-wrist coordinates. These methods are quite accurate and have also been implemented on various robots. On the other hand, OpenPose, HMR, OpenNI, VoxelNet, PoseNet, etc. as discussed in the works of literature are the most popular pose estimation algorithms and are being adopted by many researchers. Apart from these state-of-the-art algorithms, the MediaPipe framework also yields minimum error and perfectly classifies the various joint landmarks. As per the authors' knowledge, the real-time positioning of a humanoid robot arm using the MediaPipe framework is not reported. Also, the performance of the MediaPipe pose estimation framework in terms of the standard error is missing. The current research article mainly deals with the Kinect sensor-based skeleton tracking and MediaPipe HPE framework for the extraction of joint angles of human pose landmarks and its implementation on real-time humanoid robot prototypes. The performance of the adopted algorithms has been compared in terms of standard error. The main contributions of this work include a comprehensive study of various HPE algorithms and their implementation in real-time. The authors also developed a 3D-printed robot prototype used to implement the HPE framework. Also, two different methods Kinect-based skeleton tracking [49] and MediaPipe [33, 50, 51] frameworks are considered for pose estimation. Later on, an inverse kinematic algorithm is used to calculate the joint angles of a real-time robot as well as the adopted HPE framework. Comparison has been made in terms of joint angles for the adopted framework and also with a real-time robot. Finally, the standard error for all joint landmarks and arm angles is calculated. It was found that the standard error for the MediaPipe-based solution was less as compared to Kinect based skeleton tracking method. Jong et al. [52] discussed the combination of a more sophisticated humanoid model and a fast optimization method to estimate the joint angles of 3D pose estimation based on a humanoid model. Further, Alberto et al. [53] proposed a systematic procedure for collaborative tasks in a dynamic environment. The proposed methodology mainly focuses on the contribution and the mapping of reference frames.

2 Mathematical formulation and its algorithms

Many researchers have developed multiple pose estimation algorithms but, these algorithms can be based on learning approaches or human model-based approaches. These methods act as the building block for joint tracking and pose estimations. The most conventional approach is to calculate the joint angles using inverse kinematic (IK) algorithms which yield fast and accurate results based on the given end effector position and orientation. To test the IK algorithm along with HPE frameworks, a custom 3D-printed humanoid robot prototype is used which is shown in Fig. 1. The prototype humanoid robot is equipped with micro servo motors in all joints.

Fig. 1
figure 1

Prototype 3D printed humanoid robot

2.1 Forward and inverse kinematics

The kinematics of the humanoid robot's upper arm is solved by using an analytical approach. It consists of both forward and inverse kinematic equations. Initially, the forward kinematics of the robotic manipulator is solved after assigning the coordinate frames at each joint of the humanoid robotic arm to obtain the Position and orientation of the end effector. Figure 2 shows the assigning of the coordinate frames at each joint of the robotic arm. Once the forward kinematics approach is solved based on the position and orientation of the end effector the authors used the inverse kinematics approach for obtaining the joint angles. The mathematical equations related to inverse kinematics are mentioned in Eqs. (1) and (2).

$$\theta_{2} = \pm cos^{ - 1} \left( {\frac{{\left\| X \right\|^{2} - l_{1}^{2} - l_{2}^{2} }}{{2l_{1} l_{2} }}} \right)$$
(1)
$$\theta_{1}^{i} = \theta - \theta^{i}$$
(2)
Fig. 2
figure 2

Coordinate frames assigned at each joint of the robotic arm

where θ = atan2(Y1, X1), θi = atan2(l2sinθ2, l1 + l2cosθ2).

The algorithm for the upper arm is given as follows:

figure a

IK Algorithm

2.2 MediaPipe based HPE

The earlier discussed Inverse kinematic algorithm can be further implemented on the MediaPipe HPE framework to calculate the joint angles and position of the human arm. These joint angles are configured by positive and negative planes as shown in Fig. 3. If the hand falls in a positive plane the joint angle is calculated based on the arc tangent of the wrist coordinate while negative angles are calculated when the hand falls on a negative plane. Based on this concept, the position control of the robotic arm and its joint variables are communicated through the python-Arduino pyserial library. These obtained joint variables are communicated every millisecond and based on the received joint information, the robot arm mimics human gestures. The detected joint landmarks and corresponding joint angles are calculated using an inverse kinematic algorithm.

Fig. 3
figure 3

Coordinate planes for positive and negative joint angles

Further, the proposed MediaPipe graph for the pose estimation is shown in Fig. 4a, b. The proposed flow chart shows the flow and node connectivity of the proposed framework. The flow chart requires the input that is, audio or video which can be proceeded or transformed by its modular components shown in yellow and light blue components. These components are also known as a pipeline. Each pipeline is connected to specific input and output nodes and these nodes in the flow chart are implemented as a calculator. Figures 5 and 6 consists of PoseTracking and PoseRenderer components. Moreover, MediaPipe consists of three major components:

  1. 1.

    Input framework for sensory information (i.e., audio/video),

  2. 2.

    Tools for performance evaluations, and

  3. 3.

    Processing components known as calculators.

Fig. 4
figure 4

Flow chart shows a MediaPipe main pipeline b pose tracking procedure

Fig. 5
figure 5

Flow chart shows the pose detection procedure

Fig. 6
figure 6

Flowchart shows the pose renderer procedure

These components are the backbone for pose detection, object tracking, image segmentation, motion tracking, box tracing, etc. However, there are many HPE models have been proposed in recent years but MediaPipe is one of the most efficient frameworks developed to build various machine learning-based solutions. It has the flexibility to deploy mobile, web, edge, or cloud-based applications. Therefore, leveraging this framework for controlling the humanoid robot arm in real-time.

In Fig. 4a, “Input_frames_gpu” specifies the input to the graph which contains default 100 frames that can be queued for further processing. This node is further connected with the “Image Transformation” calculator which flips the input image horizontally. Node “Pose Tracking” This node performs pose tracking using a subgraph calculator: “Pose Tracking Subgraph” uses the flipped input_frames_gpu as input image and pose landmarks outputs normalized rectangle information with pose detections. At last, the Pose Renderer Subgraph Node renders the pose on the input frames using the “PoseRendererSubgraph” calculator. Multiple input streams: Takes the flipped images, pose landmarks, normalized rectangle, and pose detections as input and outputs the final frames with rendered poses to the “output_frames_gpu” stream.

In summary, this MediaPipe graph takes input frames from “input_frames_gpu,” performs image transformation, flips the frames horizontally, then uses a subgraph for pose tracking, and finally renders the poses on the input frames, producing the output frames in the “output_frames_gpu” stream. Figure 4b depicts the subgraph within the MediaPipe framework, specifically for pose tracking. This subgraph is referenced in the main graph as a node with the type “PoseTrackingSubgraph.” The subgraph takes an input video stream, performs various processing steps related to pose detection and landmark localization, and outputs pose-related information. In summary, this subgraph processes input video frames, performs pose detection and landmark localization, and outputs pose-related information, such as landmarks, normalized rectangles, and pose detections. The flow is controlled based on the presence of a pose in the previous frame, and feedback mechanisms ensure continuity in decision-making across frames.

Figures 5 and 6 consists of PoseTracking and PoseRenderer components. MediaPipe subgraph specifically for pose detection. This subgraph is used in the larger pipeline described in the previous responses. In summary, this subgraph takes an input video, transforms the images, runs a pose detection model, performs post-processing steps such as non-max suppression, adjusts detections for letterboxing, and outputs the final pose-related information, including pose detections and normalized rectangles. This subgraph is part of the overall pose-tracking pipeline described in the previous responses. Figure 6 depicts another subgraph in the MediaPipe framework, specifically for rendering the results of the pose tracking pipeline. This subgraph is referenced in the main graph as a node with the type “PoseRendererSubgraph”. In summary, this subgraph takes input streams containing pose detections, landmarks, and normalized rectangles, calculates the necessary rendering information, and outputs a final rendered image with annotations and overlays. The rendered image is then used as part of the overall pose-tracking pipeline described in the previous responses.

2.3 RGB-D-based skeleton tracking

The main challenge of these computer vision-based algorithms is to calculate the depth in real-time. This depth of information is crucial to avoid the uncertainty present in the environment and also to grasp any object of concern. To face these challenges, the Microsoft Kinect sensor can be used to calculate the depth information and position control of the robot arm. On the other hand, the mapping of multiple joint coordinate frames with respect to human pose landmarks is quite noisy and inaccurate for RGB cameras. Even though the reference coordinate of joint landmarks varies with respect to each frame the Kinect infrared (IR) sensor provides exact information. IR sensors with RGB sensors create 3D point clouds or depth profiles of an object. These points can be further used to create the joint landmarks of the human pose [51,52,53,54,55]. The wrist coordinates are extracted from the skeleton and fed to the inverse kinematics solver which is helpful to calculate the joint angles. These joint angles are communicated through a pyserial module. In the current research work, the authors considered Python 3.7.5 and pyserial 3.5 versions. Figure 7 shows the basic steps of the Kinect-based HPE approach [56].

Fig. 7
figure 7

Flow chart shows the overall structure of the proposed framework

3 Results and discussion

The performance of each framework discussed in the previous section is analyzed individually. Further, the evaluation of accuracies in terms of analytical inverse kinematic solutions is compared. Later on, real-time joint angles are recorded for calculating the error. A comparison has been made in terms of the standard error of Kinect-based skeleton tracking and MediaPipe framework as shown in Fig. 8.

Fig. 8
figure 8

Standard error-based comparison of all joint variables

Figure 9a–e shows the various joint angles such as left and right shoulder elbow and elbow wrist, and head angles obtained from the robot, Mediapipe, and Kinect sensor. It has been observed that the joint angles obtained from the Kinect sensor cause multiple misclassifications when compared to MediaPipe. In Mediapipe the joint angle data is obtained from the normal webcam which produces a better result than the Kinect sensor.

Fig. 9
figure 9

Various joint angles obtained from MediaPipe, and the Kinect sensor of the robot. a Left shoulder elbow, b right shoulder elbow, c left elbow wrist, d right elbow wrist, and e head

Further, the proposed system shows the robustness of the adopted pose estimation framework by using HPE. The details of various joint angles produced by all frameworks are collected and saved in a csv file and a sample of the collected data is represented in a boxplot as shown in Fig. 10. It has been observed that all the dynamic poses are perfectly mimicked and mapped onto real-time humanoid robot arm control. Moreover, a sample of frames recorded from the Kinect-based skeleton tracking is shown in Fig. 11. Figure 11 shows the 18 different dynamic poses depicted by the Kinect sensor; all these samples are collected from recorded video.

Fig. 10
figure 10

Various joint angles are shown in the boxplot

Fig. 11
figure 11

Kinect sensor-based skeleton tracking and corresponding joint angles

Figure 12 shows the different postures and corresponding angles obtained from the MediaPipe framework. Similar to the Kinect sensor, here also obtained eighteen different dynamic poses. Further, Fig. 13 shows that few samples are obtained in real-time 3D world coordinates of human postures. These 3D coordinates are measured in meters with the origin at the hip center. Based on the concept of positive and negative planes, the angles are shown accordingly. Finally, real-time dynamic control of the humanoid robotic arm using these frameworks is shown in Fig. 14. Although it is quite difficult to analyze these pictorial representations of pose and corresponding joint angles. Therefore, standard errors are discussed in the next section.

Fig. 12
figure 12

MediaPipe-based joint landmarks and angles

Fig. 13
figure 13

Samples of MediaPipe pose estimation real-world 3D coordinates

Fig. 14
figure 14

Real-time joint angles and positioning of the robotic hand

Figure 15 shows the error bar plot for each key point of a human pose as well as humanoid robot joint angles. These plots were drawn by collecting the postures data in real time. In the current research work, the authors used the Python matplotlib library is leveraged for plotting all the graphs. It is visible from the error bar plot that Kinect-based joint angles are far from the real-time robot joint angles. As already discussed, the number of outliers is also present in Kinect-based positioning. These error bars are coded with orange, red, and green colors for better visualizations. The length of the cap or capsize gives the error between the actual and predicted angles from all frameworks. Furthermore, the standard errors produced by each framework are given in Table 2. It has been observed that the error produced by Kinect-based joint angles compared to MediaPipe is maximum. The standard error for REW is 3.72 and for head angles, it is 0.7 as compared to MediaPipe.

Fig. 15
figure 15

Error bar plot various joint angles a LEW, b REW, c LSE, d RSE, and e Head

Table 2 Standard error comparison between Robot, Kinect, and MediaPipe joint angles

4 Conclusions

A human pose estimation framework based on real-time position control of a humanoid robot arm has been presented in this work. Initially, the proposed human and robot joint angles are captured from RGBD and 2D video webcams in real-time. Later on, the said proposed system is captured from Kinect-based skeleton tracking and the MediaPipe framework. Based on the obtained results, the position control of the humanoid robot arm using the MediaPipe pose framework with a regular webcam is also feasible. Although depth-based estimations are more popular, the availability of such platforms is not as common as compared to regular webcams or USB cameras. It is evident from the results that the MediaPipe framework tends to outperform when compared to Kinect-based skeleton tracking in all possible joint movements. The result shows that the robot can mimic a human pose in real time, regardless of surrounding luminescence or the presence of an unknown user in the frame. Moreover, the results of the comparison are demonstrated for both static and dynamic conditions of human body movement. Further, more complex robot configurations can also be considered for the development of human–robot interactions. The proposed frameworks are efficient and produce less error. Therefore, these frameworks can also be implemented for gesture-based control, medical rehabilitation, and assisting the elderly.