Introduction

Perception is of paramount importance for robots to establish a model of their internal state as well as the external environment. These models allow the robot to perform its task safely, efficiently and accurately. Perception is facilitated by various types of sensors which gather both proprioceptive and exteroceptive information. Humanoid robots, especially those which are mobile, pose a difficult challenge for the perception process: mounted sensors are susceptible to jerky and unstable motions due to the very high degrees of freedom afforded by the high number of articulable joints present on a humanoid’s body, e.g., the legs, the hip, the manipulator arms or the neck.

We organize the main areas of perception in humanoid robots into three broad yet overlapping areas for the purposes of this survey, namely, state estimation for balance and joint configurations, environment understanding for navigation, mapping and manipulation, and finally human-robot interaction for successful integration into a shared human workspace, see Fig. 1. For each area we discuss the popular application areas, the challenges and recent methodologies used to surmount them.

Internal state estimation is a critical aspect of autonomous systems, particularly for humanoid robots in order to address both low level stability and dynamics, and as an auxiliary to higher level tasks such as localization, mapping and navigation. Legged robots locomotion is particularly challenging given their inherent under-actuation dynamics and the intermittent contact switching with the ground during motion.

The application of external environment understanding has a very broad scope in humanoid robotics but can be roughly divided into navigation and manipulation. Navigation implies the movement of the mobile bipedal base from one location to another without collision thereby leaving the external environment configuration unchanged. On the other hand, manipulation is where the humanoid changes the physical configuration of its environment using its end-effectors.

It could be argued that human robot interaction or HRI is a subset of environment understanding. However, we have separated the two areas based on their ultimate goals. The goal of environment understanding is to interact with inanimate objects while the goal of HRI is to interact with humans. The set of posed challenges are different though similar principles may be reused. Human detection, gesture and activity recognition, teleoperation, object handover and collaborative actions, and social communications are some of the main areas where perception is used.

Fig. 1
figure 1

Perception for humanoid robots split into three principal areas. Left: State estimation being used to estimate derived quantities like CoM and ZMP from sensors like IMU and joint encoders. Right: Environment understanding has a very broad scope which varies from localization and mapping to environment segmentation for planning and even more application areas. Human Robot Interaction is closely related but deals exclusively with human beings rather than inanimate objects. Center: Some sensors which aid in perception for humanoid robots. Sources for labeled images- (a):[1], (b): [2] and (c): [3]

State Estimation

Recent works on humanoid and legged robots locomotion control have focused extensively on state-feedback approaches [4]. Legged robots have highly nonlinear dynamics, and they need high frequency (\(1\, kHz\)) and low latency (\(<1: ms\)) feedback in order to have robust and adaptive control systems, thereby adding more complexity to the design and development of reliable estimators for the base and centroidal states, and contact detection.

Challenges in State Estimation

Perceived data is often noisy and biased and it gets magnified in derived quantities. For instance, joint velocities tend to be noisier than joint positions, as these are obtained by numerically differentiating joint encoder values. Rotella et al. [5] developed a method to determine joint velocities and acceleration of a humanoid robot using link-mounted Inertial Measurement Units (IMUs), resulting in less noise and delay compared to filtered velocities from numerical differentiation. An effective approach to mitigate biased IMU measurements is to explicitly introduce these biases as estimated states in the estimation framework [6, 7].

The high dimensionality of humanoids make it computationally expensive to formulate a single filter for the entire state. As an alternative, Xinjilefu et al. [8] proposed decoupling the full state into several independent state vectors, and used separate filters to estimate the pelvis state and joint dynamics.

To account for kinematic modeling errors such as joint backlash and link flexibility, Xinjilefu et al. [9] introduced a method using a Linear Inverted Pendulum Model (LIPM) with an offset which represented the modeling error in the Center of Mass (CoM) position and/or external forces. Bae et al. [10] proposed a CoM kinematics estimator by including a spring and damper in the LIPM to compensate for modeling errors. To address the issue of link flexibility in the humanoid exoskeleton Atalante, Vigne et al. [11] decomposed the full state estimation problem into several independent attitude estimation problems, each corresponding to a given flexibility and a specific IMU relying only on dependable and easily accessible geometric parameters of the system, rather than the dynamic model.

In the remainder of this section, we classify the recent related works on state estimation into three main categories [12]: proprioceptive state estimation, which primarily involves filtering methods that fuse high-frequency proprioceptive sensor data; multi-sensor fusion filtering, which integrates exteroceptive sensor modalities into the filtering process; multi-sensor fusion with state smoothing, which employs advanced techniques that leverage the entire history of sensor measurements to refine estimated states.

Finally, we present a list of available open-source software for state estimation from reviewed literature in Table 1.

Table 1 Open-source software for humanoid robot state estimation. All cited software are available as ROS packages

Proprioceptive State Estimation

Proprioceptive sensors provide measurements of the robot’s internal state. They are commonly used to compute leg odometry, which captures the drifting pose. For a comprehensive review of the evolution of proprioceptive filters on leg odometry, refer to [22], and [23].

Base State Estimation

In humanoid robots, the focus is on estimating the position, velocity, and orientation of the “base” frame, typically located at the pelvis. Recent state estimation approaches in this field often fuse IMU and leg odometry.

The work by Bloesch [6] was a decisive step in introducing a base state estimator for legged robots using a quaternion-based Extended Kalman Filter (EKF) approach. This method made no assumptions about the robot’s gait and number of legs or the terrain structure and included absolute positions of the feet contact points, and IMU bias terms in the estimated states. Rotella et al. [7] extended it to humanoid platforms by considering the full foot plate and adding foot orientation to the state vector. Both works showed that as long as at least one foot remains in contact with the ground, the base absolute velocity, roll and pitch angles, and IMU biases are observable. There are also other formulations for the base state estimation using only proprioceptive sensing in [16, 24], and [25].

Centroidal State Estimation

Centroidal states in humanoid robots include the CoM position, linear and angular momentum, and their derivatives. The CoM serves as a vital control variable for stability and robust humanoid locomotion, making accurate estimation of centroidal states crucial in control system design for humanoid robots.

When the full 6-axis contact wrench is not directly available to the estimator, e.g., the robot gauge sensors measure only the contact normal force, some works have utilized simplified models of dynamics, such as the LIPM [26].

Piperakis et al. [27] presented an EKF to estimate centroidal variables by fusing joint encoders, IMU, foot sensitive resistors, and later including visual odometry in [13]. They formulated the estimator based on the non-linear Zero Moment Point (ZMP) dynamics, which captured the coupling between dynamics behavior in the frontal and lateral planes. Their results showed better performance over Kalman filter formulation based on the LIPM.

Mori et al. [28] proposed a centroidal state estimation framework for a humanoid robot based on real-time inertial parameter identification, using only the robot’s proprioceptive sensors (IMU, foot Force/Torque (F/T) sensors, and joint encoders), and the sequential least squares method. They conducted successful experiments deliberately altering the robot’s mass properties to demonstrate the robustness of their framework against dynamic inertia changes.

By having 6-axis F/T sensors on the feet, Rotella et al. [29] utilized momentum dynamics of the robot to estimate the centroidal quantities. Their nonlinear observability analysis demonstrated the observability of either biases or external wrench. In a different approach, Carpentier et al. [30] proposed a frequency analysis of the information sources utilized in estimating the CoM position, and later for CoM acceleration and the derivative of angular momentum [31]. They introduced a complementary filtering technique that fuses various measurements, including ZMP position, sensed contact forces, and geometry-based reconstruction of the CoM by using joint encoders, according to their reliability in the respective spectral bandwidth.

Fig. 2
figure 2

State estimation with multi-sensor filtering, integrating LiDAR for drift correction and localization. Top row, filtering people from raw point cloud. Bottom row, state estimation and localization with iterative closest point correction on filtered point cloud. From [12]

Contact Detection and Estimation

Feet contact detection plays a crucial role in locomotion control, gait planning, and proprioceptive state estimation in humanoid robots. Recent approaches can be categorized into two main groups: those directly utilizing measured ground reaction wrenches, and methods integrating kinematics and dynamics to infer the contact status by estimating the ground reaction forces. Fallon et al. [2] employed a Schmitt trigger with a 3-axis foot F/T sensor to classify contact forces and used a simple state machine to determine the most reliable foot for kinematic measurements. Piperakis et al. [13] adapted a similar approach by utilizing pressure sensors on the foot.

Rotella et al. [32] presented an unsupervised method for estimating contact states by using fuzzy clustering on only proprioceptive sensor data (foot F/T and IMU sensing), surpassing traditional approaches based on measured normal force. By including the joint encoders in proprioceptive sensing, Piperakis et al. [20] proposed an unsupervised learning framework for gait phase estimation, achieving effectiveness on uneven/rough terrain walking gaits. They also developed a deep learning framework by utilizing F/T and IMU sensing in each leg, to determine the contact state probabilities [33]. The generalizability and accuracy of their approach was demonstrated on different robotic platforms. Furthermore, Maravgakis et al. [34] introduced a probabilistic contact detection model, using only IMU sensors mounted on the end effector. Their approach estimated the contact state of the feet without requiring training data or ground truth labels.

Another active research field in humanoid robots is monitoring and identifying contact points on the robot’s body. Common approaches focus on proprioceptive sensing for contact localization and identification. Flacco et al. [35] proposed using an internal residual of external momentum to isolate and identify singular contacts, along with detecting additional contacts with known locations. Manuelli et al. [36] introduced a contact particle filter for detecting and localizing external contacts, by only using proprioceptive sensing, such as 6-axis F/T sensors, capable of handling up to 3 contacts efficiently. Vorndamme et al. [37] developed a real-time method for multi-contact detection using 6-axis F/T sensors distributed along the kinematic chain, capable of handling up to 5 contacts. Vezzani et al. [38] proposed a memory unscented particle filter algorithm for real-time 6 Degrees of freedom (DoF) tactile localization using contact point measurements made by tactile sensors.

Multi-Sensor Fusion Filtering

One drawback of base state estimation using proprioceptive sensing is the accumulation of drift over the time, due to sensor noise. This drift is not acceptable for controlling highly dynamic motions, therefore it is typically compensated by integrating other sensor modalities from exteroceptive sensors, such as cameras, depth cameras, and LiDAR.

Fallon et al. [2] proposed a drift-free base pose estimation method by incorporating LiDAR sensing into a high-rate EKF estimator using a Gaussian particle filter for laser scan localization. Although their framework eliminated the drift, a pre-generated map was required as input. Piperakis et al. [39] introduced a robust Gaussian EKF to handle outlier detection in visual/LiDAR measurements for humanoid walking in dynamic environments. To address state estimation challenges in real-world scenarios, Camurri et al. [12] presented Pronto, a modular open-source state estimation framework for legged robots Fig. 2. It combined proprioceptive and exteroceptive sensing, such as stereo vision and LiDAR, using a loosely-coupled EKF approach.

Multi-Sensor Fusion with State Smoothing

So far, we have explored filtering methods based on Bayesian filtering for sensor fusion and state estimation. However, as the number of states and measurements increases, computational complexity becomes a limitation. Recent advancements in computing power and nonlinear solvers have popularized non-linear iterative maximum a-posteriori (MAP) optimization techniques, such as factor graph optimization.

To address the issue of visual tracking loss in visual factor graphs, Hartley et al. [40] introduced a factor graph framework that integrated forward kinematic and pre-integrated contact factors. The work was extended by incorporating the influence of contact switches and associated uncertainties [41]. Both works showed that the fusion of contact information with IMU and vision data provides a reliable odometry system for legged robots.

Solá et al [18] presented an open-source modular estimation framework for mobile robots based on factor graphs. Their approach offered systematic methods to handle the complexities arising from multi-sensory systems with asynchronous and different-frequency data sources. This framework was evaluated on state estimation for legged robots and landmark-based visual-inertial SLAM for humanoids by Fourmy et al. [26].

Environment Understanding

Environment understanding is a critical area of research for humanoid robots, enabling them to effectively navigate through and interact with complex and dynamic environments. This field can be broadly classified into two key categories: 1. localization, navigation and planning for the mobile base, and 2. object manipulation and grasping.

Perception in Localization, Navigation and Planning

Localization focuses on precisely and continuously estimating the robot’s position and orientation relative to its environment. Planning and navigation involve generating optimal paths and trajectories for the robot to reach its desired destination while avoiding obstacles and considering task-specific constraints.

Localization, Mapping and SLAM

Localization and SLAM (simultaneous localization and mapping) relies primarily on visual sensors such as cameras and lasers but often additionally use encoders and IMUs to enhance estimation accuracy.

Localization

Indoor environments are usually considered structured, characterized by the presence of well-defined, repeatable and often geometrically consistent objects. Landmarks can be uniquely identified by encoded vectors obtained from visual sensors such as depth or RGB cameras allowing the robot to essentially build up a visual map of the environment and then compare newly observed landmarks against a database to localize via object or landmark identification. In recent years, the use of handcrafted image features such as SIFT and SURF and feature dictionaries such as the Bag-of-Words (BoW) model in landmark representation has been superseded by feature representations learned through training on large example sets, usually by variants of artificial neural networks such as convolutional neural networks (CNNs). CNNs have also outperformed classifiers such as support vector machines (SVMs) in deriving inferences [42, 43]. However, several rapidly evolving CNN architectures exist. Ovalle-magallanes et al. [44] performed a comparative study of four such networks while successfully localizing in a visual map.

The RoboCup Soccer League is popular in humanoid research due to the visual identification and localization challenges it presents. [45, 46] and [47] are some examples of real-time, CNN based ball detection approaches utilizing RGB cameras developed specifically for RoboCup. Cruz et al. [48] could additionally estimate player poses, goal locations and other key pitch features using intensity images alone. Due to the low on-board computational power of the humanoids, others have used fast, low power external mobile GPU boards such as the Nvidia Jetson to aid inference [47, 49].

Unstructured and semi-structured environments are encountered outdoors or in hazardous and disaster rescue scenarios. They have a dearth of reliably trackable features, unpredictable lighting conditions and are challenging for gathering training data. Thus, instead of features, researchers have focused on raw point clouds or combining different sensor modalities for navigating such environments. Starr et al. [50] presented a sensor fusion approach which combined long-wavelength infrared stereo vision and a spinning LiDAR for accurate rangefinding in smoke-obscured environments. Nobili et al. [51] successfully localized robots constrained by a limited field-of-view LiDAR in a semi-structured environment. They proposed a novel strategy for tuning outlier filtering based on point cloud overlap which achieved good localization results in the DARPA Robotics Challenge Finals. Raghavan et al. [52] presented simultaneous odometry and mapping by fusing LiDAR and kinematic-inertial data from IMU, joint encoders, and foot F/T sensors while navigating a disaster environment.

SLAM

SLAM subsumes localization by the additional map construction and loop closing aspects, whereby the robot has to re-identify and match a place which was visited sometime in the past, to its current surroundings and adjust its pose history and recorded landmark locations accordingly. A humanoid robot which is intended to share human workspaces needs to deal with moving objects, both rapid and slow, which could disrupt its mapping and localizing capabilities. Thus, recent works on SLAM have focused on handling the presence of dynamic obstacles in visual scenes. While the most popular approach remains sensor fusion [53, 54], other purely visual approaches have also been proposed, such as, [55] which introduced a dense RGB-D SLAM solution that utilized optical flow residuals to achieve accurate and efficient dynamic/static segmentation for camera tracking and background reconstruction. Zhang et al. [56] took a more direct approach which employed deep learning based human detection, and used graph-based segmentation to separate moving humans from the static environment. They further presented a SLAM benchmark dedicated to dynamic environment SLAM solutions [57]. It included RGB-D data acquired from an on-board camera on the HRP-4 humanoid robot, along with other sensor data. Adapting publicly available SLAM solutions and tailoring it for humanoid use is not uncommon. Sewtz et al. [58] adapted the Orb-Slam [59] for a multi-camera setup on the DLR Rollin’ Justin System while Ginn et al. [60] did it for the iGus, a midsize humanoid platform, to have low computational demands.

Navigation and Planning

Navigation and planning algorithms use perception information to generate a safe, optimal and reactive path, considering obstacles, terrain, and other constraints.

Local Planning

Local planning or reactive navigation is generally concerned with local real-time decision-making and control, allowing the robot to actively respond to perceived changes in the environment and adjust its movements accordingly. Especially in highly controlled applications rule-based, perception driven navigation is still popular and yields state-of-the-art performance both in terms of time demands and task accomplishment. Bista et al. [61] achieved real-time navigation in indoor environments by representing the environment by key RGB images, and deriving a control law based on common line segments and feature points between the current image and nearby key images. Regier et al. [62] determined appropriate actions based on a pre-defined set of mappings between object class and action. A CNN was used to classify objects from monocular RGB vision. Ferro et al. [63] integrated information from a monocular camera, joint encoders, and an IMU to generate a collision-free visual servo control scheme. Juang et al. [64] developed a line follower which was able to infer forward, lateral and angular velocity commands using path curvature estimation and PID control from monocular RGB images. Magassouba et al. [65] introduced an aural servo framework based on auditory perception, enabling robot motions to be directly linked to low-level auditory features through a feedback loop.

We also see the use of a diverse array of classifiers to learn navigation schemes from perception information. Their generalization capability allows adaptation to unforeseen obstacles and events in the environment. Abiyev et al. [66] presented a vision-based path-finding algorithm which segregated captured images into free and occupied areas using an SVM. Lobos-tsunekawa et al. [67] and Silva et al. [68] proposed deep learned visual (RGB) navigation systems for humanoid robots which were able to achieve real time performance. The former used a reinforcement learning (RL) system with an actor-critic architecture while the latter utilized a decision tree of deep neural networks deployed on a soccer playing robot.

Global Planning

These algorithms operate globally, taking into account long-term objectives and optimize movements to minimize costs, maximize efficiency, or achieve a specific outcome on the basis of a perceived environment model.

Footstep Planning is a crucial part of humanoid locomotion and has generated substantial research interest for itself. Recent works exhibit two primary trends related to perception. The first is providing humanoids the capability of rapidly perceiving changes in the environment and reacting through fast re-planning. The second endeavors to segment and/or classify uneven terrains to find stable 6 DoF footholds for highly versatile navigation.

Fig. 3
figure 3

Footstep planning on the humanoid Lola from [69]. Top left: The robot’s vision system and a human causing disturbance. Bottom right: The collision model with geometric obstacle approximations

Tanguy et al. [54] proposed a model predictive control (MPC) scheme that fused visual SLAM and proprioceptive F/T sensors for accurate state estimation. This allowed rapid reaction to external disturbances by adaptive stepping leading to balance recovery and improved localization accuracy. Hildebrandt et al. [69] used the point cloud from an RGB-D camera to model obstacles as swept-sphere-volumes (SSVs) and step-able surfaces as convex polygons for real-time reactive footstep planning with the Lola humanoid robot. Their system was capable of handling rough terrain as well as external disturbances such as pushes (see Fig. 3). Others have also used geometric primitives to aid in footstep planning, such as surface patches for foothold representation [70, 71], environment segmentation to find step-able regions, such as 2D plane segments embedded in 3D space [72, 73], or represented obstacles by their polygonal ground projections [74]. Suryamurthy et al. [75] assigned pixel-wise terrain labels and rugosity measures using a CNN consuming RGB images for footstep planning on a CENTAURO robot.

Whole Body Planning in humanoid robots involves the coordinated planning and control of the robot’s entire body to achieve an objective. Coverage planning is a subset of whole body planning where a minimal sequence of whole body robot poses are estimated to completely explore a 3D space via robot mounted visual sensors [76, 77]. Target finding is a special case of coverage planning where the exploration stops when the target is found [78, 79]. These concepts are related primarily to view planning in computer vision. In other applications, Wang et al. [80] presented a method for trajectory planning and formation building of a robot fleet using local positions estimated from onboard optical sensors and Liu et al. [81] presented a temporal planning approach for choreographing dancing robots in response to microphone-sensed music.

Perception in Grasping and Manipulation

Manipulation and grasping in humanoid robots involve their ability to interact with objects of varying shapes, sizes, and weights, to perform dexterous manipulation tasks using their sensor equipped end-effectors which provide visual or tactile feedback for grip adjustment.

Grasp Planning

Grasp planning is a lower level task specifically focused on determining the optimal manipulator pose sequence to securely and effectively grasp an object. Visual information is used to find grasping locations and also as a feedback to optimize the difference between the target grasp pose and the current end-effector pose.

Schmidt et al. [82] utilized a CNN trained on object depth images and pre-generated analytic grasp plans to synthesize grasp solutions. The solution generated full end-effector poses and could generate poses not limited to the camera view direction. Vezzani et al. [83] modeled the shape and volume of the target object captured from stereo vision in real-time using super-quadric functions allowing grasping even when parts of the object were occluded. Vicente et al. [84] and Nguyen [85] focused on achieving accurate hand-eye coordination in humanoids equipped with stereo vision. While the former compensated for kinematic calibration errors between the robot’s internal hand model and captured images using particle based optimization, the latter trained a deep neural network predictor to estimate the robot arm’s joint configuration. [86] proposed a combination of CNNs and dense conditional random fields (CRFs) to infer action possibilities on an object (affordances) from RGB images.

Fig. 4
figure 4

Left: A Nao humanoid equipped with artificial skin cells on the chest, hand, fore arm, and upper arm. Right: Visualization of the skin cell coordinate frames on the Nao. Figure taken from [87]

Tactile sensors, such as pressure-sensitive skins or fingertip sensors, provide feedback about the contact (surface normal) forces, slip detection, object texture, and shape information during object grasping. Kaboli et al. [87] extracted tactile descriptors for material and object classification agnostic to various sensor types such as dynamic pressure sensors, accelerometers, capacitive sensors, and impedance electrode arrays. A Nao with artificial skin used for their experiments is shown in Fig. 4. Hundhausen et al. [88] introduced a soft humanoid hand equipped with in-finger integrated cameras and an in-hand real-time image processing system based on CNNs for fast reactive grasping.

Manipulation Planning

Manipulation planning involves the higher-level decision-making process of determining how the robot should manipulate an object once it is grasped. It generates a sequence of motions or actions which is updated based on the continuously perceived robot and grasped object state.

Deep recurrent neural networks (RNNs) are capable of predicting the next element in a sequence based on the previous elements. This property is exploited in manipulation planning by breaking down a complex task into a series of manipulation commands generated by RNNs based on past commands. These networks are capable of mapping features extracted from a sequence of RGB images, usually by CNNs, to a sequence of motion commands [89, 90]. Inceoglu et al. [91] presented a multimodal failure monitoring and detection system for robots which integrated high-level proprioceptive, auditory, and visual information during manipulation tasks. Robot assisted dressing is a challenging manipulation task that has been addressed by multiple authors. Zhang et al. [92] utilized a hierarchical multi-task control strategy to adapt the humanoid robot Baxter’s applied forces, measured using joint torques, to the user’s movements during dressing. By tracking the subject human’s pose in real-time using capacitive proximity sensing with low latency and high signal-to-noise ratio, Erickson et al. [93] developed a method to adapt to human motion and adjust for errors in pose estimation during dressing assistance by the PR2 robot. Zhang et al. [94] computed suitable grasping points on garments from depth images using a deep neural network to facilitate robot manipulation in robot-assisted dressing tasks.

Human-Robot Interaction

Human robot interaction is a subset of environment understanding which deals with interactions with humans as opposed to inanimate objects. In order to achieve this, a robot needs diverse capabilities ranging from detecting humans, recognizing their pose, gesture, and emotions, to predicting their intent and even proactively performing actions to ensure a smooth and seamless interaction.

There are two main challenges to perception in HRI - perception of users, and inference which involves making sense of the data and making predictions.

Perception of Users

This involves identifying humans in the environment, detecting their pose, facial features, and objects they interact with. This information is crucial for action prediction and emotion recognition [95]. Robots rely on vision-based, audio-based, tactile-based, and range sensor-based sensing techniques for detection as explained in this survey on perception methods of social robots done by [96].

Robinson et al. [97] showed how vision-based techniques have evolved from using facial features, motion features, and body appearance to deep learning-based approaches. Motion-based features separate moving objects from the background to detect humans. Body appearance-based algorithms use shape, curves, posture, and body parts to detect humans. Deep learning models like R-CNN, Faster R-CNN, and YOLO have also been applied for human detection [96].

Pose detection is essential for understanding human body movements and postures. Sensors such as RGB cameras, stereo cameras, depth sensors, and motion tracking systems are used to extract pose information. This was explained in detail by Möller et al. [98] in their survey of human-aware robot navigation. Facial features play a significant role in pose detection as they provide additional points of interest and enable emotion recognition [99]. A great demonstration of detecting pose and using it for bi-manual robot control using an RGB-D range sensor was shown by Hwang et al. [100]. The system employed a CNN from the OpenPose package to extract human skeleton poses, which were then mapped to drive robotic hands. The method was implemented on the CENTAURO robot and successfully performed box and lever manipulation tasks in real-time. They presented a real-time pose imitation method for a mid-size humanoid robot equipped with a servo-cradle-head RGB-D vision system. Using eight pre-trained neural networks, the system accurately captured and imitated 3D motions performed by a target human, enabling effective pose imitation and complex motion replication in the robot. Lv et al. [101] presented a novel motion synchronization method called GuLiM for teleoperation of medical assistive robots, particularly in the context of combating the COVID-19 pandemic. Li et al. [102] presented a multimodal mobile teleoperation system that integrated a vision-based hand pose regression network and an IMU-based arm tracking method. The system allowed real-time control of a robot hand-arm system using depth camera observations and IMU readings from the observed human hand, enabled through the Transteleop neural network which generated robot hand poses based on a depth image input of a human hand.

Audio communication is vital for human interaction, and robots aim to mimic this ability. Microphones are used for audio detection, and speakers reproduce sound. Humanoid robots are usually designed to be binaural i.e., they have two separate microphones at either side of the head which receive transmitted sound independently. Several researchers have focused on this property to localize both the sound source and the robot in complex auditory environments. Such techniques are used in speaker localization, as well as other semantic understanding tasks such as automatic speech recognition (ASR), auditory scene analysis, emotion recognition, and rhythm recognition [96, 103].

Benaroya et al. [104] employed non-negative tensor factorization for binaural localization of multiple sound sources within unknown environments. Schymura et al. [105] focused on combined audio-visual speaker localization and proposed a closed-form solution to compute dynamic stream weighting between audio and visual streams, improving the state estimation in a reverberant environment. The previous study was extended to incorporate dynamic stream weights into nonlinear dynamical systems which improved speaker localization performance even further [106]. Dávila-Chacón et al. [107] used a spiking and a feed-forward neural network for sound source localization and ego noise removal respectively to enhance ASR in challenging environments. Trowitzsc et al. [108] presented a joint solution for sound event identification and localization, utilizing spatial audio stream segregation in a binaural robotic system.

Ahmad et al. [109] in their survey on physiological signal-based emotion recognition showed that physiological signals from the human body, such as such as heart rate, blood pressure, body temperature, brain activity, and muscle activation can provide insights into emotions. Tactile interaction is an inherent part of natural interaction between humans and the same holds true for robots interacting with humans as well. The type of touch can be used to infer a lot of things such as the human’s state of mind, the nature of the object, what is expected out of the interaction, etc. [96]. Mainly two kinds of tactile sensors are used for this purpose - sensors embedded on the robot’s arms and grippers, and cover based sensors which are used to detect touch across entire regions or the whole body [96]. Khurshid et al. [110] investigated the impact of grip-force, contact, and acceleration feedback on human performance in a teleoperated pick-and-place task. Results indicated that grip-force feedback improved stability and delicate control, while contact feedback improved spatial movement but may vary depending on object stiffness.

Inference

An important aspect of inference with all the detected data from the previous section is regarding aligning the perspective of the user and the robot. This allows the robot to better understand the intent of the user regarding the objects or locations they are looking at. This skill is called perspective taking and requires the robot to consider and understand other individuals through motivation, disposition, and contextual attempts. This skill paired with a shared knowledge base allows the individuals and robots to build a reliable theory of mind and collaborate effectively during various types of tasks [3].

Bera et al. [111] proposed an emotion-aware navigation algorithm for social robots which combined emotions learned from facial expressions and walking trajectories using an onboard and an overhead camera respectively. The approach achieved accurate emotion detection and enabled socially conscious robot navigation in low-to-medium-density environments.

Table 2 A non-exhaustive, indicative list of popular humanoid robot models used by different publications

Conclusion

Substantial progress have been made in all three principal areas discussed in this survey. In Table 2 we compile a list of the most commonly cited humanoids in the literature corresponding to the aforementioned categorization. We conclude with a summary of the trends and possible areas of further research we observed in each of these areas.

State Estimation

Tightly-coupled formulation of state estimation based on MAP seems to be promising for future works as it offers several advantages, such as modularity and enabling seamless integration of new sensor types, and extending generic estimators with accommodating a wider range of perception sources in order to develop a whole-body estimation framework. By integrating high-rate control estimation and non-drifting localization based on SLAM, this framework could provide real-time estimation for locomotion control purposes, and facilitate gait and contact planning.

Another important area of focus is the development of multi-contact detection and estimation methods for arbitrary unknown contact locations. By moving beyond rigid segment assumptions for humanoid structure and augmenting robots with additional sensors, such as strain gauges to directly measure segment deflections; the multi-contact detection and compensating for modeling errors can lead to more accurate state estimation and improved human-robot interactions.

Environment Understanding

With the availability of improved inference hardware, learning techniques are increasingly being applied in localization, object identification, and mapping, replacing handcrafted feature descriptors. However, visual classifiers like CNNs struggle with unstructured “stuff” compared to regularly shaped objects, necessitating memory-intensive representations such as point clouds and the need for enhanced classifier capabilities. In the field of SLAM, which has robust solutions for static environments, research is focused on handling dynamic obstacles by favoring multi-sensor fusion for increased robustness. Scalability and real-time capability remain challenging due to the potential overload of a humanoid’s onboard computer from wrangling multiple data streams over long sequences. Footstep planning shows a trend towards rapid environment modeling for quick responses, but consistent modeling of dynamic obstacles remains an open challenge. Manipulation and long-term global planning also rely on learning techniques to adapt to unforeseen constraints, requiring representations or embeddings of high-dimensional interactions between perceived elements for complexity reduction. However, finding more efficient, comprehensive, and accurate methods to express these relationships is an ongoing challenge.

Human Robot Interaction

Research in the field of HRI has focused on understanding human intent and emotion through various elements such as body pose, motions, expressions, audio cues, and behavior. Though this may seem natural and trivial from a human’s perspective, it is often a very challenging task to incorporate the same into robotic systems. Despite considerable progress in the above approaches, the ever-changing and unpredictable nature of human interaction necessitates additional steps that incorporate concepts like shared autonomy and shared perception. In this context, contextual information and memory play a crucial role in accurately perceiving the state and intentions of the humans with whom interaction is desired. Current research endeavors are actively focusing on these pivotal topics, striving to enhance the capabilities of humanoid robots in human-robot interactions while also considering trust, safety, explainability, and ethics during these interactions.