1 Introduction

In humans, vision is the most important source of information for object perception. However, haptic feedback is crucial, too. The challenges posed by the absence of vision can be easily experienced by anyone just by trying to perform daily tasks blindfolded or in the dark. Less common is to experience the lack of haptic perception. Frigid fingers, caused by either coldness (e.g., frostnip or frostbite) or specific health conditions (e.g., anaemia), are one example; simply wearing thick gloves is another one, although the impairment is less evident. Early scientific experiments conducted by Westling and Johansson (1984) have shown how simple manipulation tasks, such as lighting a match, become almost impossible if the tactile feedback is removed by temporarily anaesthetizing the fingertips.

The situation is similar for robots. While vision is a primary source of information, some important object properties cannot be perceived using (only) vision, such as weight, material, or texture. Imagine the case of a robot sorting boxes based on whether they are empty or not without inspecting their content. Such a robot can only do this job if it can perceive the weight of the boxes, both to adjust the grip force (also combining the perceived friction coefficient, i.e., by feeling the texture) and to correctly classify the boxes. In addition, even for properties that are well detected by vision, such as the position or shape of the object, there are cases in which the sole reliance on this sensory modality is limiting, for example in settings characterized by unpredictable changes in the lighting conditions, or when dealing with translucent, reflecting, and occluded objects. Relying on multiple sensory modalities can help resolve these perceptual ambiguities.

The idea of integrating vision and touch was first proposed by Allen (1984) to generate descriptions of object surfaces. Allen (1988) extended this idea to encompass the whole object recognition task. Since then, much work has been done on recognizing and manipulating objects based on one modality, i.e., based on either vision or haptics alone (Please refer to, e.g., Zhao et al. (2019); Fanello et al. (2017); Guo et al. (2016); Du et al. (2021) for an extensive overview of visual object perception and Seminara et al. (2019); Luo et al. (2017); Kappassov et al. (2015) for an extensive overview of haptic object perception). Despite the significant progress achieved in the field based on either visual or haptic information, the combination thereof has attracted less attention in comparison, e.g., Liu et al. (2017a); Yang et al. (2015).

Usually, in machine learning applications, visual and haptic perception are treated as two separate processes that converge at some point to a final classification result, e.g., Liu et al. (2020); Cui et al. (2020). However, in the brain, interactions between vision and touch take place in the cerebral cortex (Lacey & Sathian, 2016). These interactions can be crossmodal, meaning that the haptic stimuli activate regions traditionally believed to be visual or multimodal, in which case the visual and the haptic stimuli converge.

This article presents a holistic overview of multimodal object perception for robots from both a bio-inspired and a technical point of view. Firstly, the biological basis of visuo-haptic object perception is introduced. Secondly, a summary of tactile sensors and multimodal datasets are provided. Thirdly, the computational challenges of multimodal signal processing are presented. Fourth, the main application areas are introduced and reviewed, including multimodal object recognition, peripersonal space representation, and object manipulation. Finally, challenges and future directions for research on artificial visuo-haptic object perception are discussed.

2 Neural basis of visuo-haptic object perception

The fact that there is no learning algorithm yet that reaches the level of proficiency of the human brain when it comes to recognizing objects illustrates how complex this cognitive task actually is (Smith et al., 2018; Krüger et al., 2013; James et al., 2007). The human brain is capable of performing it both quickly and accurately, even when the visual information available is incomplete or ambiguous. One reason might be that the brain can complement that ‘picture’ with information from other sensory modalities at will; usually, it does this with haptics. However, it is also because the learning machinery in the human brain seems to be suited to learn from drastically different frequency distributions than those used in machine learning, as described by Smith et al. (2018). In particular, infants seem to use curriculum learning constrained by their developing sensorimotor abilities and actions. However, what is in strong contrast with machine learning algorithms is that the learning machinery, at least in infants, is particularly effective in learning from extremely skewed frequency distributions, i.e., a very small number of instances are highly frequent while most other instances are encountered very rarely. For instance, in very young infants, more than 80% of faces they are exposed to are from 2-3 individuals (Smith et al., 2018).

We argue that taking inspiration from the complementary nature of sensory modalities as well as processes in the brain that are involved in fusing the information they provide during object perception, might help build better robotic systems. While this topic is an active area of research and considerable new insights have been gained, there are still many aspects about the inner workings of the human brain during object perception that are not fully understood.

In this section, we present a short review of what is known on visuo-haptic object perception and recognition in the brain (or more specifically in the cerebral cortex), focusing on the main organizational and functional principles that can serve as a basis for computational modelling given the complexity of this topic and the abundance of research available.

2.1 Visual object perception

For every basic sense, a primary sensory area can be identified in the cerebral cortex, the earliest cortical area in the brain’s outer layer to process the sensory stimuli coming from the respective receptors. For vision, that area, the primary visual cortex (V1) (Krüger et al., 2013; Grill-Spector & Malach, 2004; Malach et al., 1995) is located on the backside of the brain, in what is referred to as the occipital lobe.

The neurons here are organized in a way that allows for neighbouring regions in the retina, and hence in the visual input, to be projected onto neighbouring areas in V1. Retinotopic maps emerge from this orderly arrangement in V1 and subsequent lower visual areas, where the output of the processing at the level of very primitive visual features is forwarded to.

The hierarchical organization of the visual cortical areas and the receptive field size of the neurons gradually increasing with each new area along this hierarchy turns the visual information into more complex and abstract representations (Ungerleider & Haxby, 1994; Krüger et al., 2013; Grill-Spector & Malach, 2004). This hierarchical organization is what convolutional neural networks (CNNs) take their inspiration from computationally (Fukushima, 1980; LeCun et al., 2015).

Hierarchical organization aside, the processing of the visual stimuli following V1 has been found to diverge into two main pathways or streams (Ungerleider & Haxby, 1994; Mishkin et al., 1983), see Fig. 1. One stream runs ventrally, extending into the temporal lobe of the cortex, and is responsible for the visual identification of objects, while the other runs dorsally, reaching into the parietal lobe, and enables the visual location of and spatial relations among objects (Mishkin et al., 1983). The ventral and dorsal streams are, therefore, also called the “what” and “where” pathway, respectively. A modification to this model was later introduced to distinguish between “vision for perception” and “vision for action” and to emphasize that the dorsal stream also coordinates visually guided actions directed at objects (Goodale & Milner, 1992). Hence, these streams are alternatively referred to as “perception” and “action” pathways. The overall model became known as the two visual systems (TVS) model (Rossetti et al., 2017; Milner, 2017; de Haan et al., 2018; Goodale & Milner, 2018).

The idea that the neural substrates underlying each visual processing stream are distinct was initially proposed by Goodale et al. (1991); Goodale and Milner (1992) and widely accepted since. However, it has become the subject of controversy as of late for being oversimplified (de Haan & Cowey, 2011; Sheth & Young, 2016; Rossetti et al., 2017; de Haan et al., 2018). There is evidence for cross-talk between the two streams: ventral to dorsal when information about the object and its qualities is required to plan and fine-tune a grasping action (Perry & Fallah, 2014; van Polanen & Davare, 2015; Milner, 2017), and dorsal to ventral, when updated grasp-related information helps refine the 3D perception and possibly the internal representation of objects (van Polanen & Davare, 2015; Freud et al., 2016; Milner, 2017). Nevertheless, the TVS model has inspired a considerable amount of research in this area and hence remains influential (de Haan et al., 2018; Goodale & Milner, 2018).

Fig. 1
figure 1

The dorsal and ventral streams originate from the primary visual cortex (V1). The arrow from the right to the top left represents the dorsal stream, and the arrow from the right to the bottom left represents the ventral stream. Adapted from Young et al. (2013) CC BY 4.0

Zooming in on the perception pathway, the division into functional streams seems to be a recurring pattern in the cortex as evidence suggests that there is a further specialization into sub-streams here, one dedicated to object form and another to surface properties (Cant et al., 2009; Cant & Goodale, 2007). The posterior-lateral regions of the occipito-temporal part of the cerebral cortex, including the lateral occipital area (LO), were shown to contribute to the perception of object form. Meanwhile, the more medial parts of the ventral stream handle the perception of object surface properties like texture or colour. In particular, areas along the collateral sulcus (CoS) have been found to respond to texture specifically. In contrast, an analogous area for colour could not be identified: it is believed that the processing of information related to surface colour occurs relatively early along the ventral stream compared to surface texture. In general, it appears that areas showing form selectivity overlap with those involved in object recognition and identification. Similarly, there seems to be an overlap between areas selective to object surface properties with the fusiform gyrus (FG), an area in the temporal lobe taking care of perception of more complex stimuli categories like faces and places (Cant & Goodale, 2007).

Further studies have confirmed and added to these findings (Cavina-Pratesi et al., 2010a, b). Accordingly, there is not one single cortical area but multiple interacting foci in the medial ventral stream region that infer the material properties of perceived objects from extracted individual surface properties. A texture-selective area appears to be located posterior to a colour-selective one. Also, areas showing responsiveness to multiple object properties were detected next to areas of dedicated single-feature processing (Cavina-Pratesi et al., 2010a, b).

Overall, visual information can be located at three different levels of abstractions in the cerebral cortex along the ventral visual stream: between retinotopy and stimulus categories (objects, faces, places, etc.), there is an intermediate level of representation based on geometric and material properties (Cavina-Pratesi et al., 2010a). This hierarchical functional organization is advantageous (Krüger et al., 2013): using separate but highly interconnected channels for processing different types of visual information (colour, shape, etc.) allows for representations that are both robust against missing cues and efficient, as the combinatorial explosion and the resulting lack of generalization to new objects that an integrated representation would cause, is prevented.

2.2 Prehension of objects

Object perception benefits greatly from performing explorat-ory procedures (EPs) on an object of interest, to observe different sides of an object or perceive non-visual features for instance. For that, we first reach towards that object, i.e., move our hand close to its location, and then grasp it, which involves pre-shaping our hand to the object’s physical properties and selecting the optimal grip type. The capacity to reach and grasp objects is also more generally referred to as prehension (Turella & Lingnau, 2014).

Initially, it was thought that the detailed organization of the dorsal stream reflects these two components of prehension, again in the form of independent pathways, as in the case of the ventral stream (see Sect. 2.1). According to this classical model, one pathway comprises the more laterally located areas of the dorsal stream and controls grasping, whereas the medial areas form the other pathway, which is recruited during reaching. Hence, these two pathways are also called the dorsolateral and dorsomedial pathways, respectively (Fattori et al., 2010; Turella & Lingnau, 2014; Rizzolatti & Matelli, 2003).

Later on, it was shown that this initial model has limitations: Fattori et al. (2010), for instance, offers evidence that the dorsomedial pathway is not only for reaching and that it may play a central part in all phases of reach-to-grasp action. In their review on the coding of prehension in the brain, Turella and Lingnau (2014) conclude that the coding of grasping, maybe even the integration with reaching, seems to happen in both pathways and that the temporal difference in the onset of processing suggests that the processing in the dorsomedial pathway is driven by the dorsolateral one. The authors argue that this aspect could yield a more fitting functional characterization of the pathways instead of grasping and reaching: There is strong evidence that the dorsolateral pathway is in charge of creating an action plan and the dorsomedial one follows with online adjustment.

More recent findings support that the role of the dorsomedial pathway goes beyond just online control and adjustment during prehension: It has been suggested that the early dorsomedial areas are involved in the biomechanical selection of viable grasp postures during reach-to-grasp behaviours (Galletti & Fattori, 2018) and even before, that is in preparation of the action execution (Santandrea et al., 2018).

2.3 Importance of haptics for object perception

Although we primarily rely on our vision for object perception and recognition, we may occasionally use our other senses in the face of very ambiguous, and hence difficult, cases. The sensory modality that we then typically resort to is haptics, which is complementary to vision in many regards. With our vision, we are capable of perceiving multiple object properties at one glance, whereas haptic perception can involve a sequence of steps to accomplish the same (Lederman & Klatzky, 1987). Our eyes may sometimes provide access to only a limited perceptual space, be it due to visual impairments or the conditions in our environment. In such cases, our skin, as our largest sensory organ, combined with active touch and exploration, can help us enlarge that space and perceive what we otherwise would not be able to. That is because the sets of visually and haptically perceivable object properties are largely complementary.

Lederman and Klatzky (1987) have identified patterns for how objects are typically explored manually. These patterns are referred to as exploratory procedures (EPs) (Lederman & Klatzky, 1987, 2009). These EPs can be roughly distinguished into three categories, namely those related to the substance of an object (texture, hardness, temperature, and weight), those related to the structural properties of an object (global shape and exact shape, volume, and weight) and those for discovering the function of an object (finding the movable parts, deducing the potential function based on its form). Examples of the exploratory procedures for the first two categories are shown in Fig. 2.

Fig. 2
figure 2

Illustration of six exploratory procedures, as described by Lederman and Klatzky (2009). From left to right and top to bottom: Contour Following, Pressure, Enclosure, Unsupported Holding, Static Contact, and Lateral Motion. Adapted from Nelinger et al. (2015) CC BY 3.0

There are eight EPs in total (Lederman & Klatzky, 1987): an object’s texture can be explored using the lateral motion EP, where the fingers or other parts of the skin are moved along its surface. With the pressure EP, which can manifest itself in either a poking or tapping movement, the hardness of an object can be tested. The static contact EP is for feeling the object’s temperature by briefly and passively touching its surface. Using the unsupported holding EP, an object’s weight can be inferred from the effort needed to balance the object at a certain height. An object’s global shape and volume can be sensed with the help of the enclose EP, which involves placing the hands around the object to cover as much of its surface as possible, repeatedly if needed, and positioning the hands differently each time. During the contour following EP, the object’s contours are traced, which allows for the local shape or volume of an object to be perceived in more detail. The part motion test EP is used to detect to which extent object parts move when force is applied to them, while the function test EP examines what functions an object can potentially fulfil by randomly interacting with it.

2.4 Haptic object perception

We usually (and intuitively) think of haptic perception as anything we can perceive using our touch sense, i.e., our skin. The skin is innervated with receptors that can be divided into three groups based on their function (Purves et al. 2012, Chap. 9): mechanoreceptors react to mechanical pressure or vibration and thermoceptors to changes in temperature, whereas nociceptors create the sensation of pain in the case of powerful stimuli that could be damaging, see Fig. 3.

However, proprioception, the sense of self-movement and body position perceived from stimuli originating from receptors embedded in the muscles, joints, and tendons (Lederman & Klatzky, 2009; Dahiya & Valle, 2013), often also called kinesthesia, plays an essential role in the haptic perception of objects. An object property that shows the relevance of the kinesthetic sense is shape (Lederman & Klatzky, 2009): what helps us determine an object’s shape is the alignment of the bones and the stretching of our muscles when we enclose it with our hands. Similarly, when we are prompted to describe the shape of an object, we tend to demonstrate it with hand poses.

Fig. 3
figure 3

Primary mechanoreceptors in the human skin. Merkel’s cells respond to light touch, Meissner’s corpuscles respond to touch and low-frequency vibrations. Rufinni endings respond to deformations and warmth. Pacinian corpuscles respond to transient pressure and high-frequency vibrations. Krause end bulbs respond to cold. Image from Clark et al. (2020) CC BY 4.0

The primary sensory area for haptic perception is the primary somatosensory cortex (S1) (Purves et al. 2012, Chap. 9), (James et al., 2007). It is located in the parietal lobe in the so-called postcentral gyrus and is, from anterior to posterior, comprised of the Brodmann areas 3, further subdivided into 3a and 3b, 1 and 2, see Fig. 4. S1 is organized somatotopically across all Brodmann areas. Like retinotopy, somatotopy is a form of topographical organization, resulting in a map of the complete body in each Brodmann area, though not in actual proportion: the area dedicated to each body part in S1 directly reflects the density of receptors in it. The feet, legs, trunk, forelimbs, and face are represented from medial to lateral in these somatotopic maps, see Fig. 5.

Fig. 4
figure 4

Somatosensory Cortex. The primary somatosensory cortex (S1) consists of the Area 1 (Blue), Area 2 (Green), Area 3a (Orange), and Area 3b (Yellow). The secondary somatosensory cortex (S2) is depicted in red. Image derivative from Selket under CC BY-SA 3.0 and based on Purves et al (2012, p., 202)

Fig. 5
figure 5

The cortical sensory Homunculus. A representation of the human body based on the proportions of the cortical regions dedicated to processing sensory functions. Image from Young et al. (2013) CC BY 4.0

Like vision, the processing of the somatic sensations occurs hierarchically: each area receives the information from the periphery, but areas 1 and 2 also receive input from 3a and 3b. Most of the initial processing of the somatosensory input happens in area 3, where area 3a is concerned explicitly with the proprioceptive and 3b with the cutaneous stimuli. Because area 3b is densely connected to areas 1 and 2, the extracted cutaneous information is forwarded to these areas for higher-level processing. Here, area 1 seems to be in charge of texture discrimination, and area 2, involving proprioceptive stimuli, of size and shape discrimination.

The functional divergence into separate pathways might not be only specific to the visual system. The somatosensory system may be organized similarly with two or potentially even more pathways (Sathian et al., 2011; James & Kim, 2010), though different views exist on this matter, see James and Kim (2010) for a review. Object-related haptic activation has been detected outside the somatosensory cortex in multiple areas along the ventral visual pathway. The lateral occipital complex (LOC) was found to respond selectively to object features in both vision and haptics (Malach et al., 1995). In particular, a subregion of the LOC called lateral occipital tactile-visual region (LOtv) appears to be a bimodal convergence area concerned with the recovery of the geometric shape of objects (Amedi et al., 2001, 2002; Tal & Amedi, 2009). While not bimodal in nature, haptic activation was also detected in the medial occipitotemporal cortex in response to surface texture (Podrebarac et al., 2014; Whitaker et al., 2008). This area is close to the one along the CoS concerned with visual texture perception but still spatially distinguishable. The representation of texture information in the visual and haptic modalities differs from that of shape information. However, the processing might not be entirely independent: the proximity of both areas might, in fact, enable cross-modal interaction.

The representation of object weight is located in the medial ventral visual pathway as well (Gallivan et al., 2014; Kentridge, 2014), which might also explain our ability to associate a certain weight to an object just based on what we perceive visually, without having actually explored it haptically. It also gives rise to the assumption that other properties, such as object hardness, are dealt with similarly.

2.5 Integration of visual and haptic experiences

The reliability of each sensory modality plays a crucial role in how our brain weighs and combines our visual and haptic experiences of an object to more abstract and meaningful concepts (Helbig & Ernst, 2007; Ernst & Banks, 2002). We are not born with this ability; it emerges and matures as we live and accumulate experiences of the world. While we do so, the neurons in our brain organize among themselves, a process which has been termed input-driven self-organization (Miikkulainen et al., 2005).

The integration of multiple sensory modalities at the level of a single neuron has been studied in the cat superior colliculus (Stein et al., 2014). Newborn cats can already detect certain cross-modal correspondences, but the ability to integrate information from different senses develops after birth. The underlying neural circuitry adapts to the cross-modal experiences of the environment while optimizing the multisensory integration capabilities. This learning process does not wait for the contributing unisensory system to fully mature. Both the unisensory perceptual skills and the ability to integrate information from multiple senses develop in parallel.

A lot speaks for self-organization among the neurons being a fundamental principle for how the brain functions. One example is the neurons in the primary visual cortex that learn selectivity for certain features like orientation and colour and form different cortical feature maps (Miikkulainen et al., 2005). The coarse structure of these feature maps is predetermined even before birth by retinotopy, while the more granular structure is shaped by visual experience after birth. The first few weeks seem especially critical: experiments have shown that depriving kittens of typical visual experience in this stage of their development can cause irreparable permanent physiological effects, even blindness (e.g., Hubel and Wiesel 1970; Blakemore and Cooper 1970; Blakemore and Van Sluyters 1975). The somatic sensory maps develop in a similar manner, possibly starting with the first body movements while still in the womb (Mountcastle, 2005).

A behavioural study performed by Gori et al. (2008) offers the most important evidence thus far on the role of input-driven self-organization in our acquiring of visuo-haptic integration capabilities. They found that a human’s ability to integrate visual and haptic inputs related to object form becomes statistically optimal between the ages 8 and 10. The weight that children below that age range assign to either modality often does not correspond to their respective reliability in a particular situation. Further, perceptual illusions, such as the rubber hand illusion (RHI), indicate that the temporal co-occurrence between unimodal experiences is what triggers the creation of associative links between the sensory modalities (Botvinick & Cohen, 1998). The likelihood of stimuli coming from the two modalities being integrated increases if it is known that they originate from the same object or are otherwise spatially related (Helbig & Ernst, 2007).

2.6 Organizational principles

We do not have a complete picture of how object perception works in the brain and how visual and haptic cues are combined to accomplish object-related tasks. However, we can derive some basic principles from the evidence presented above that could help us build robots with human-like proficiency in object perception:

Hierarchical processing: Object recognition and identification are performed by the ventral visual pathway, which starts in the occipital lobe and reaches down to the temporal lobe in the cerebral cortex. The processing of the visual input occurs in a hierarchical fashion along this pathway, with increasingly complex and abstract features being extracted.

Separate substreams for object shape and material perception: Some areas along the ventral pathway are responsive to haptic stimuli. Bimodal activation has been detected in the LOC, in charge of perceiving the geometric shape of objects. Neighbouring and sometimes crossmodally interacting foci specialized in the processing of material properties were identified in more medial areas of the ventral pathway, along with the CoS specifically. This evidence supports the idea that the ventral pathway is further organized into two substreams for object shape and material perception stretching across the more lateral and more medial areas, respectively.

Input-driven self-organization: The ability to integrate the visual and haptic input in a statistically optimal way is not innate but emerges only after birth as we experience the world around us. Here, unimodal stimuli’s temporal and spatial co-occurrence serves as a trigger for multimodal integration.

3 Multimodal object perception in robots

The previous section presented some organisational and functional principles that enable visuo-haptic object perception and recognition in the brain. The following sections cover the sensory and computational aspects used for visuo-haptic object perception and recognition in robots and other artificial systems and indicate how they relate to their biological counterpart. We start with a brief overview of visual sensors, follow up with the topics of tactile sensors, and continue with data collection and datasets.

3.1 Visual sensors

Visual sensors or cameras are ubiquitous nowadays and designed to create images that are interpretable by humans. Although their working principle has been perfected in the past two hundred years (Brady et al., 2018), the field continues to evolve. However, due to the abundance of material for visual sensors and their applications, we will provide only a short overview of the most common technologies used in robotic applications before moving on to the less established tactile sensing technologies.

Cameras capturing visible light (400-700nm) have become commodities. Most of the research and application in robotics and computer vision have specialized in greyscale or RGB images obtained with these types of cameras. However, they have been optimized for human interpretation rather than computer vision and robotics. Moreover, their performance is significantly impacted by environmental conditions such as illumination intensity and direction, fog, haze, and smoke (Gade & Moeslund, 2014). Thus specialized solutions optimized for computation are needed. Some of these alternatives might be RGB-D, thermal cameras (Gade & Moeslund, 2014), parallel cameras (Brady et al., 2018) or event cameras (Gallego et al., 2022).

Nowadays, some of the most common sensors used for visual perception in robotics are consumer-grade RGB or RGB-D cameras. RGB-D cameras provide a visible light (RGB) image and a depth image used for the 3D perception of a scene. These cameras produce depth images using near-infrared (NIR) light projection (750-1400nm) and different working principles, such as time-of-flight (ToF) for the Microsoft Kinect v2, structured-light (SL) for the Asus Xtion Pro Live, and active stereo vision (ASV) for the Intel Realsense R200 cameras (Kuan et al., 2019).

Thermal cameras capture infrared radiation. Although initially developed as a surveillance and night vision tool for the military, as the technology has matured and the price has dropped, their use has expanded to other fields of application such as robotics (Gade & Moeslund, 2014).

More recently, event cameras have also become popular in robotics research. They are bio-inspired sensors that asynchronously measure per-pixel changes and output a stream of events that encode the changes’ time, location and sign. This operation principle translates to high temporal resolution, very high dynamic range, low power consumption, and high pixel bandwidth, which are attractive properties for mobile robotics, augmented and virtual reality (AR/VR), and video game applications (Gallego et al., 2022).

3.2 Tactile sensors

Tactile sensors are mostly designed to mimic mechanoreceptors, particularly to detect mechanical pressure. The main objectives of tactile sensors are to determine the location, shape and intensity of contacts. These properties are determined by measuring the instantaneous pressure or force applied to the sensor’s surface on multiple contact points. Also, the contact’s late effects, i.e., body-borne vibrations, may carry relevant information. Body-borne vibrations are not as commonly measured or exploited as part of haptic sensing; however, there are some examples, e.g., Syrymova et al. (2020); Toprak et al. (2018), including sensors that are inspired by the hair follicle receptors or ciliary structure (Alfadhel & Kosel, 2015; Ribeiro et al., 2017; Kamat et al., 2019) and that have been proven very effective in obtaining information about the texture of objects (Ribeiro et al., 2020b, a).

Thermoceptors, although an integral part of human haptic perception, are typically not classified as tactile sensors within robotic applications. However, they are sometimes included because they might help compensate for thermal effects (Tomo et al., 2016), thus helping to obtain a more robust electronic signal related to pressure or vibrations, or because they might help to classify the material of the object in contact (Wade et al., 2017). In contrast, nociceptors have not yet been developed as part of haptic or tactile sensing per se but can be and have been implemented in software based on the limitations of robots (e.g., Navarro-Guerrero et al. 2017b, a).

Technologies for tactile sensing have been developed since the early ’70s and have greatly improved in the past ten years (Dahiya et al., 2010; Dahiya & Valle, 2013; Kappassov et al., 2015), but the field is still young, and there are no widely accepted solutions. Several transduction methods have been explored, including capacitive (e.g., Larson et al. 2016), piezoelectric (e.g., Seminara et al. 2013), piezoresistive (e.g., Jung et al. 2015), optical (e.g., Ward-Cherrier et al. 2018; Kuppuswamy et al. 2020), fiber optics (e.g., Polygerinos et al. 2010), and magnetic (e.g., Jamone et al. 2015). Table 1 summarizes the advantages and disadvantages of the different transduction principles for detecting mechanical pressure. For additional information, please refer to Chi et al. (2018).

Table 1 Transduction mechanisms for detecting mechanical pressure. Tactile sensor design is benefiting from rapid nanomaterial and nanocomposite fabrication technology advancements. This table is based on Chi et al. (2018)

3.2.1 Commercial sensors

Although there are some commercial solutions, the costs are still relatively high, and the performance level is not always satisfactory. In the remainder of this section, we present some of the commercial solutions for tactile sensing. Although we are aware of other commercial sensors, such as the WTS-FT by Weiss Robotics GmbH & Co. KG., all but the presented here seem to have been discontinued at the time of writing.

The BioTac® sensor by SynTouch® was launched in 2008. The sensor’s design attempts to mimic some of the human fingertip’s physical properties and sensory capabilities. It consists of a rigid core surrounded by an elastic bladder filled with liquid. This construction provides a compliant surface, allowing it to sense force, vibration, and temperature. SynTouch® offers variations of the technology tailored to different applications. Examples for robotic applications are shown in Fig. 6.

Fig. 6
figure 6

From the left: SynTouch® BioTac®, BioTac® SP, and NumaTac® Tactile Sensors. Images used with permission from SynTouch®https://syntouchinc.com/

The DIGIT tactile sensor (Lambeta et al., 2020) by GelSight is an optical tactile sensor using a piece of elastomeric gel with a reflective membrane coat on top, which enables it to capture fine geometrical textures as a deformation in the gel. A series of LEDs with RGB colour illuminates the gel such that a camera can record the deformation.

Seed Robotics’ FTS Tactile pressure sensors (see Fig. 7) are low-cost sensors that offer high-resolution contact force measurement (1mN/0.1g resolution up to 30N range). The sensor compensates for temperature, and it is immune to magnetic interference. The sensors are directly integrated into the robotic hands also offered by the company. However, there is a stand-alone version of the sensor for use in third-party user applications.

Fig. 7
figure 7

Left: the SINGLEX stand-alone tactile pressure sensor version. Right: FTS tactile pressure sensor mounted on a robot finger. Images used with permission from Seed Robotics https://www.seedrobotics.com/

The uSkin sensor by Xela Robotics is a magnetic tactile sensor composed of small magnets embedded in a thin layer of flexible rubber and placed above a matrix of magnetic Hall-effect sensor chips. Upon contact, the magnets are displaced and the magnetic field sensed by the Hall-effect chips changes; the contact forces can be estimated from these variations in the magnetic field. The uSkin sensor can measure the full 3D force vector (i.e., both normal and shear contact forces) at each tactel, with a good spatial resolution (about 1.6 tactels for square cm), high sensitivity (minimum detectable force of 1gf), and high frequency (\(>100\)Hz, depending on the configuration). Different versions of the sensor are available to cover both flat and curved surfaces, see Fig. 8 for an example.

Fig. 8
figure 8

Left: a flat version inspired by Tomo et al. (2018a). Right: a curved version inspired by Tomo et al. (2018b). Images with permission from Xela Robotics https://xelarobotics.com/

Finally, Contactile offers both a stand-alone sensor and tactile sensor arrays called PapillArray sensor, see Fig. 9. These optical sensors consist of infrared LEDs, a diffuser, and four photodiodes encapsulated in a soft silicone membrane. The photodiodes are used to measure the light intensity patterns to infer the displacement and force applied to the membrane. This strategy allows for the measurement of 3D deflections, 3D forces and 3D vibrations, as well as the inference of emergent properties such as torque, incipient slip and friction.

Fig. 9
figure 9

Left: Single 3D force tactile sensor. Right: A slim tactile sensor array (PapillArray Sensor) available in different configurations. Images from Contactile https://contactile.com/ licensed under CC BY-NC-ND 4.0

The need for such technologies is pushing research forward in the development of both, new sensing technologies and applications such as robotic grasping, smart prostheses, and surgical robots. In particular, enhancements are still needed in a number of aspects (e.g., mechanical robustness, sensitivity and reliability of the measurements, ease of electromechanical integration and replacement) to deploy sensors in practical applications.

Of particular interest are solutions that: are flexible (Larson et al., 2016; Senthil Kumar et al., 2019), stretchable (Bhattacharjee et al., 2013; Büscher et al., 2015) and can cover sizeable (Dahiya et al., 2013) and multi-curved (Juiña Quilachamín & Navarro-Guerrero, 2023; Tomo et al., 2018b) surfaces (possibly with a small number of electrical connections (Juiña Quilachamín & Navarro-Guerrero, 2023)), can detect multiple contacts at the same time (Hellebrekers et al., 2020), can detect both normal and shear forces (Tomo et al., 2018a), can dynamically change the range and sensitivity of the measurements depending on the task (Holgado et al., 2018), are affordable and can be easily manufactured (Juiña Quilachamín & Navarro-Guerrero, 2023; Paulino et al., 2017). For more information on experimental tactile sensing technologies see Chi et al. (2018), and for a specialized review of printable, flexible and stretchable tactile sensors, see Senthil Kumar et al. (2019).

3.3 Data collection and datasets

Data acquisition from tactile sensors still lacks a unified theoretical framework. Besides the sensor itself, tactile data is affected by the sequence of exploration procedures (EPs, see Sect. 2.3) and the application in which it is to be used in, among others. A single grasp can only perceive a portion of an object’s properties, and the perception is limited to the surface that comes in contact with the tactile sensors. Thus, it is difficult, if not impossible, to recognize all properties of an object using one single tactile EP. Unlike vision, tactile perception is intrinsically sequential.

Authors such as Kappassov et al. (2015), and Liu et al. (2017a) have defined tactile object recognition into subcategories in an attempt to create a unified framework for data collection. Kappassov et al. (2015) propose to divide tactile perception into tactile object identification, texture recognition, and contact pattern recognition. Whereas Liu et al. (2017a) propose to divide tactile perception into perception for shape, perception for texture, and perception for deformable objects. However, there is still no consensus on how to collect and organize data for haptic or visuo-haptic object recognition datasets.

In this section, we provide examples of datasets for multimodal object recognition and grasping.

3.3.1 Datasets for multimodal object recognition

One example of such a dataset comes from Kroemer et al. (2011), who generated a small-scale multimodal dataset for dynamic tactile sensing. Tactile information was collected using a custom whisker-like tactile sensor whose data resembles the Lateral Motion EP. Data were collected for a total of 26 surfaces of 17 different materials. Visual information was collected by taking four grayscale pictures of those objects from different perspectives.

Sinapov et al. (2014) created a multimodal object recognition dataset comprising proprioceptive, auditory, and visual information but not tactile information. The dataset consists of 100 objects from 20 different categories. All objects were explored five times, using nine haptic interactions, and photographed.” The interactions were not extensively described and thus cannot be confidently mapped to Lederman’s EPs. They included press and poke (Pressure), grasp (Enclosure), lift, hold and push (app. Unsupported Holding), plus tap, drop and shake, which seems to be primarily related to gathering auditory information, as well as the corresponding RGB image of the objects or an RGB video while performing the EPs.

Chu et al. (2015) collected a small-scale multimodal dataset for haptic perception, known as the Penn Haptic Adjective Corpus 2 (PHAC-2). The PHAC-2 dataset consists of haptic data collected with a pair of SynTouch® BioTac® sensors, which were mounted on the grippers of a Willow Garage PR2 robot. The labels were collected in a human study, where 25 haptic adjectives were assigned to the objects. The PHAC-2 dataset contains haptic and visual data for 60 household objects. Given the robot’s and BioTac® sensors’ physical constraints, the objects were chosen to fit the following physical characteristics: the objects had to be between 15 and 80mm in width and a minimum height of 100mm. There were no restrictions regarding weight since the objects were not lifted. All objects included needed to be at room temperature, clean, dry, and durable. Furthermore, the object could not be sharp or pointed. Haptic data were collected for four EPs, namely, Pressure (Squeeze), Enclosure and Static Contact (Hold), Lateral Motion. The dataset includes two versions of the Lateral Motion EP. The first version, referred to as slow slide, is performed with low velocity and substantial contact force, and the second version, called the fast slide, is of higher speed and half the contact force as for a slow slide. Every EP was repeated ten times per object, and the objects were re-positioned each time. Meanwhile, the visual data consists of high-resolution images of each object from eight different viewpoints.

Another small-scale dataset for visuo-haptic object recognition comes from Toprak et al. (2018). A NAO robot (model T14: torso-only) was used. Visual data was collected using one of the two RGB cameras in NAO’s head. For the kinesthetic properties, namely, global shape and weight, the joint angles and the electric currents in the motors in both arms were measured when performing the respective EPs. For texture and hardness, inexpensive contact microphones were attached as sensors to NAO’s arm and a custom-made table, on which it performed the corresponding EPs to capture the resulting vibrations transmitted across the surfaces. A total of 11 everyday objects were carefully selected to cover both visually and haptically ambiguous objects. Of each object, ten observations were collected under optimal lighting conditions (controlled and reproducible lab conditions) and another three under real-world lighting.

More recently, Bonner et al. (2021) created a public dataset for visuo-haptic object recognition containing information of 63 different objects. The visual information comes from high-resolution RGB images collected using near-ideal lighting conditions. The kinesthetic data was collected with the RH8D Robotic Hand by Seed Robotics using the Unsupported Holding and Enclosure EPs. The tactile information was captured using contact microphones mounted on the RH8D hand and on a NAO robot that was used to perform the Lateral Motion and Pressure EPs.

3.3.2 Datasets for multimodal object perception for manipulation

Calandra et al. (2017) provided a dataset for evaluating grasp success. Their hardware setup consisted of a 7-DoF Sawyer manipulator equipped with a WSG-50 gripper, one GelSight tactile sensor for each of the two gripper fingers and a Kinect V2 camera placed in front of the robot. First, using the Kinect’s depth information, the object’s position on a table in front of the robot was inferred. The gripper was randomly positioned above the object with its fingers opened. Next, a closing action was executed, and the gripper was lifted from the table. After the lifting action, the tactile and visual information was used to infer whether the object was still on the table or successfully grasped. A label indicating the grasp success was automatically generated. The dataset collected through this automated data collection procedure consists of a total of 9269 grasp samples for 106 different objects.

Another visuo-tactile dataset for grasping and related tasks such as slip-detection or visuo-tactile object classification is presented by Wang et al. (2019). They used two Intel RealSense SR300 cameras and a UR5 robot arm equipped with an Eagle Shoal hand with piezoresistive tactile sensors. The objects to be grasped were 10 everyday grocery items like detergent bottles or soup cans, intentionally selected to be container-like and either full or empty for generating different tactile readings. The dataset includes 2550 grasping attempts containing information like RGB and depth images from different grasp stages and videos of the whole grasp, tactile information from the 16 tactile sensors included in the hand and ground truth information including timestamps and grasp outcome.

In the same direction, Li et al. (2018) introduced a dataset for slip detection during manipulation. Their setup consisted of a 6-DoF UR5 robot arm and a WSG-50 parallel gripper, with one gripper’s finger replaced by a GelSight sensor for tactile recordings, and a regular webcam mounted on the side of the gripper for visual recordings. The authors thresholded the relative displacement between the object’s texture and markers of the GelSight sensor during a grasp attempt to detect if a slip occurred. The dataset covers examples of translational, rotational and incipient slips. The data acquisition was done by taking a sequence of consecutive tactile and corresponding camera image pairs at a frequency of 20 Hz. Their dataset consists of 1102 grasp-and-lift attempts on 84 different household objects with varying sizes, shapes, surface textures, materials and weights. The authors provide data of 152 grasp attempts on 10 additional objects for testing purposes.

While the previously presented datasets use a robot to collect the information, some datasets of human grasping can also be used to train robotic grasping. For instance, Brahmbhatt et al. (2019) provide a multimodal dataset from human grasps of household objects. Participants were instructed to grasp 3D-printed objects with a specific post-grasp functional intent. Different post-grasp functional intents lead to different grasping approaches, even for the same object e.g., when instructed to hand it off vs to use it. The contact surface of the hand with the object represents the haptic modality, which is captured by a FLIR Boson 640 thermal camera. In contrast, the visual modality is represented by RGB-D images collected with a Kinect V2 camera. The dataset contains 375 000 synchronized RGB-D and thermal images collected during grasping 50 different household objects, giving rich information about human grasps through detailed contact maps.

4 Multimodal machine learning

Once the multimodal data from the sensors, such as those presented in Sect. 3.2, has been collected, it needs to be processed and integrated to make it useful. Relying on different sensory modalities offers several advantages, as discussed in Sect. 2.5. However, the heterogeneity of the data (cf. Sect. 3.3) creates multiple challenges. Understanding these challenges can help in applications and guide the development of new signal processing methodologies to deal with the complexities of multimodal information. In particular, Baltrušaitis et al. (2019) identifies five core challenges: representation, translation, alignment, fusion and co-learning.

In the rest of this section, we outline these general challenges and comment on how they relate to the concrete case of visuo-haptic perception in robotics to facilitate the understanding of architectural decisions and design choices for approaches presented in Sect. 5.

4.1 Representation

The first challenge refers to creating or learning a meaningful representation that allows for the preservation and exploitation of the complementarity or redundancy of the multiple modalities. A representation or feature vector/tensor can be an image, an audio sample, or discrete values such as open or close. Some of the challenges in creating useful representations from multimodal data are:

  • how to deal with different levels of noise?

  • how to deal with missing data?

  • how to deal with out of phase signals or different frequency rates?

  • how to deal with different vector sizes?

Bengio et al. (2013) suggested some desirable properties for representations, including:

  • Smoothness: similarity of concepts should be preserved in the representation space.

  • Natural clustering: different concepts should lead to differentiated representations.

  • Temporal and spatial coherence: consecutive (for sequential data) or spatially close observations should be associated with relevant regions of the representation space.

  • Sparsity: most extracted features should be insensitive to minor variations of any given observation.

  • Expressive: should capture a large number of possible input configurations.

  • Distributed: to allow for reuse and recombination of the activation of parameters or subsets of features across concepts.

  • A hierarchical organization of explanatory factors: increasingly abstract features should be defined in terms of less abstract ones.

More recently, Baltrušaitis et al. (2018, 2019) proposed two categories of multimodal representation: joint and coordinated representations. Joint representations take all the available modalities as input and are used to create a single joint representation. In coordinated representations each modality is used to create an independent representation. However, intermediate features across modalities are ‘coordinated’ using similarity or structure constraints. Similarity-based coordination could, for instance, minimize a distance metric between the features. In structure-constrained coordination, constraints such as order are used. Examples of structure-constrained coordination are hashing, cross-modal retrieval, and image captioning (Baltrušaitis et al., 2018, 2019).

The modality representation also affects the fusion strategy (see Sect. 4.4), e.g., while optical tactile sensors such as GelSight or vibration data via spectrograms could allow for early integration with visual data, kinesthetic information would likely not.

4.2 Translation / mapping

A second challenge concerns the translation or mapping of data from one modality to another. In addition to the heterogeneity of the data, the mapping is often not unique and potentially subjective. Thus, the evaluation of the mapping becomes a challenge (Baltrušaitis et al., 2019, 2018).

Baltrušaitis et al. (2019, 2018) indicate that several machine learning applications correspond to translation between two modalities, such as automated text translation, image or video captioning, and speech transcription. In the context of multimodal object perception, translation could, for instance, serve as a mechanism to deal with the absence of a modality.

Baltrušaitis et al. (2019) further categorize multimodal translation into two categories: example-based and generative. Example-based models use a dictionary, which makes models large, task-specific and unwieldy. In contrast, generative approaches construct a model to perform the translation. However, generative models are challenging to build as they require understanding both the source and target modality (Baltrušaitis et al., 2019).

Three broad categories can be identified within generative models: rule-based, encoder-decoder, and continuous generation models (Baltrušaitis et al., 2019). Rule-based models rely on pre-defined rules to translate features. They are more likely to generate syntactically or logically correct translations. Typically, the representation of each modality should share similarities with the representations of the other modalities; for example, Falco et al. (2017) employ point clouds as a visuo-haptic common representation, and they combine data pre-processing, feature engineering and transfer learning techniques to realize an effective mapping. In fact, this category of approaches often requires complex pre-processing pipelines to create the features used for the translation (Baltrušaitis et al., 2019). Encoder-decoder models, on the other hand, encode the source modality to a latent representation which is then used by a decoder to generate the target modality (Keren et al., 2018), reducing the requirements of data pre-processing and feature engineering, although typically requiring larger amounts of data to obtain effective mappings.

Continuous generation models generate the target modality continuously based on a stream of source modality inputs and are most suited for translating between temporal sequences. In general, these models require temporal consistency between modalities (Baltrušaitis et al., 2019); however, learning from weakly-paired training data has been recently attempted by Liu et al. (2019), using sparse dictionary learning.

4.3 Alignment

Determining the relationship between features across modalities is another challenge for multimodal machine learning (Baltrušaitis et al., 2019, 2018). Similarly, as for the translation challenge, here, the evaluation metrics might be the primary challenge. However, other challenges exist, such as the availability of datasets for evaluation, long-range dependencies and ambiguities, and the lack of correspondence between modalities.

Baltrušaitis et al. (2019) identifies two types of alignment: explicit and implicit. For explicit alignment, the alignment is obvious and easier to measure, such as in automatic video captioning, or in the context of visuo-haptics, the alignment between thermal and RGB-D images in the multimodal dataset of Brahmbhatt et al. (2019) presented in Sect. 3.3.2. While for implicit alignment, a latent or intermediate representation is used, for instance, image retrieval based on text description where words are associated with regions of an image, or visuo-tactile fusion learning methods with self-attention mechanisms (Cui et al., 2020).

Aligning features across modalities could be necessary to exploit the complementarity of the different modalities.

4.4 Fusion

A fourth challenge is to join information from multiple modalities. Three approaches can be identified based on how the information from different modalities is combined: pre-mapping, midst-mapping and post-mapping fusion (Sanderson & Paliwal, 2004; Toprak et al., 2018). These strategies are also referred to as early, intermediate, and late integration (e.g., Keren et al. 2018).

In pre-mapping fusion, the feature descriptors from the different modalities are concatenated into a single vector prior to the mapping into the decision space. While this strategy is simplistic and hence easy to implement, the disadvantage is that each modality’s impact on the result is fixed as it depends on the respective feature vector’s size instead of its statistical relevance. In midst-mapping fusion, the feature descriptors are provided to the model separately. The model then processes these descriptors in separate streams and integrates them while performing the mapping. Lastly, in post-mapping fusion, each feature descriptor is first mapped into the decision space separately, after which the decisions are combined to a final result. Figure 10 illustrates the different information strategies.

Fig. 10
figure 10

Information fusion strategies. Example with two modalities. Top: Monolithic or pre-mapping fusion. Middle: midst-mapping fusion. Bottom: post-mapping fusion, the feature modalities modules are not strictly necessary

Apart from being the most frequently used, midst-mapping fusion appears to be the most promising among these three approaches as far as performance is concerned (Castellini et al., 2011). Moreover, this integration strategy would also be the best choice considering the principles on how multimodal object recognition is organized in the brain, as outlined in Sect. 2.6, since the hierarchical processing in substreams that later converge to a decision can be modelled with it. This kind of setup has been used extensively with two substreams focusing on processing visual and haptic inputs separately. Nevertheless, to the best of our knowledge, only Toprak et al. (2018) have investigated all three principles simultaneously, also including the separate processing of object shape and material properties in two substreams as well as the use of self-organizing mechanisms for processing and integration of the information.

4.5 Co-learning or transfer learning

The final challenge described by Baltrušaitis et al. (2018) is co-learning. Co-learning is described as a more general form of transfer learning at the level of representation or inference. Co-learning is particularly useful when data for some modality is limited, and information from a different modality can be used to aid training by exploiting complementary information across modalities. Thus, it is particularly relevant in multimodal object perception, where visual data is ubiquitous and tactile data is scarce. Co-learning is task-independent and could be used in fusion, translation, and alignment models (Baltrušaitis et al., 2018).

Baltrušaitis et al. (2019) identified three types of co-learning approaches: parallel, non-parallel, and hybrid. Parallel-data approaches required observations from the same dataset and instances. In contrast, non-parallel data approaches can use data from a different dataset with overlapping classification categories. Finally, hybrid data approaches use a shared modality or dataset to achieve the transfer (Baltrušaitis et al., 2019). More recently, Rahate et al. (2022) further extended this taxonomy to include cases for missing modalities, the presence of noise, annotations, domain adaptation, and interpretability and fairness. For a complete description of the taxonomy and examples, please see Rahate et al. (2022).

The reduced number and small size of public datasets for multimodal object perception motivates the study of transfer learning from visual object recognition to tactile object recognition. Such initiatives would also help to cope with the diverse number of robot embodiments, i.e., different sensors and actuators, which hinders progress on multimodal object perception. However, knowledge transfer from one modality to another is still an incipient field of research.

5 Applications of multimodal object perception

This section presents examples of multimodal object perception applications, from object recognition, peripersonal space representation, and object manipulation. However, due to the heterogeneity of the applications, experimental setups and datasets, no cross-comparison will be provided. Hence, some examples are shown to provide a glance into the state of the art of multimodal object perception applications.

5.1 Multimodal object recognition

Object recognition and the recognition of their properties are crucial for effective interaction with them both in biological and artificial systems. As such, an extensive body of work in this field exists. Here, we provide an overview of the techniques commonly used to address this problem.

5.1.1 Unsupervised learning

Toprak et al. (2018) presented a brain-inspired architecture for visuo-haptic object recognition, as outlined in Sect. 2.6. Toprak et al. implemented an architecture including main principles identified in the processing of object-related stimuli in the brain, which are 1) hierarchical processing, 2) the processing of stimuli separated by object properties rather than by modality, and 3) experience-driven learning. Toprak et al. compared their brain-inspired architecture against a monolithic architecture or pre-mapping fusion, where the features of all modalities were concatenated before processing, and a modality-based integration strategy, where visual and haptic features were preprocessed in two separate streams before being integrated into a final object classifier. Both of these strategies are commonly used in multimodal learning. To explore whether the brain-inspired processing principles could be useful for artificial agents, Toprak et al. implemented all three processing architectures using growing when required (GWR) neural networks on the same dataset and preprocessed input vectors. The hyperparameters for each architecture were optimized separately using hyperopt. The results indicate that hierarchical processing was indeed beneficial. However, results for the other two principles were not conclusive, and further research is needed. Toprak et al. further indicated that the size and quality of the dataset used might have played an essential role in exploring the value of processing object properties versus modalities in different streams.

5.1.2 Supervised learning

Güler et al. (2014) used pre-mapping fusion to classify the content of containers. The containers were squeezed, and both pressure and perceived visual deformation were used for classification. A three-fingered Schunk Dextrous Hand with pressure-sensitive tactile sensors was used to collect the haptic information, and an RGB-D camera placed 1 meter away was used to collect the visual data, but only a small region of interest around the finger of the robots was used for classification. The Tetra Pak containers were either empty or filled 90% with water, yoghurt, flour, or rice. The collected data from multiple grasps was classified using k-means, quadratic discriminant analysis (QDA), k-nearest neighbours (kNN), and support vector machines (SVM). The results show that either modality is sufficient to perform the classification in this case, but classification accuracy can improve up to around 3% under the tested conditions when the modalities are combined.

Corradi et al. (2017) compared one pre-mapping fusion approach and two midst-mapping fusion approaches. They used an optical tactile sensor, which consists of an illuminated ballon-like silicone membrane, and an internal camera detecting the shadow patterns created on the membrane. The camera images were processed using Zernike moments, which provided rotational invariance, and then PCA was used for dimensionality reduction. The visual data was processed using a bag-of-words (BoW) model of SURF features. The visuo-tactile recognition process was then performed in three manners: (1) for the pre-mapping fusion approach, the unimodal feature vectors were concatenated, and kNN was used for classification, for the midst-mapping fusion approaches, the posterior probabilities (the probability of the label given the observation) were estimated for each modality, and the classification was performed based on (2) either on the object label that maximizes their product or (3) the object label that maximizes the sum of these posterior probabilities weighted by the number of training samples available for each modality. Corradi et al. showed that multimodal classification achieves higher classification accuracy than either modality alone, and the posterior product approach achieves the highest classification accuracy among the tested approaches.

Bhattacharjee et al. (2018) combined haptic information (i.e., force + motion) with thermal sensing to recognise objects in daily living environments. Several machine learning techniques were compared to train and test classifiers on a dataset of more than 60 objects. The data were collected with different robot movements (e.g., speed, direction) and at different times of the day (e.g. morning, afternoon, night) to reproduce the variability encountered in real-world conditions, generating significant differences in the haptic and thermal information. The results highlighted the importance of using multimodal information, especially in very unstructured environments characterised by high variability of the sensing conditions.

Liu et al. (2017b); Liu and Sun (2018) implemented a midst-mapping fusion approach using a kernel sparse coding method. Liu et al. used a three-fingered BarrettHand with capacitive tactile sensors in all three fingers and the palm. The tactile sensors have 24 taxels per finger with a spatial resolution of 5mm. The tactile information was processed using the canonical time-warping (CTW) method. At the same time, they used the covariance descriptor to characterize the visual information. The dataset consisted of 18 household objects. In general, kernel sparse coding (KSC) uses the idea that a signal can be reconstructed as a linear combination of atoms from a dictionary with which the data can then be encoded sparsely. However, this method fails to capture the intrinsic relations between the different data sources, and thus it can only be applied to each modality separately. To address that problem, Liu et al. proposed the joint group kernel sparse coding (JGKSC). Their results showed that fusing the visual and tactile information using the JKGSC method led to a higher classification accuracy than applying kNN or KSC to each modality separately.

More recently, deep learning methods have also started to be used in multimodal object recognition. For instance, Gao et al. (2016) implemented a deep learning-based midst-mapping fusion approach and tested it on the PHAC-2 dataset. The haptic data from the two BioTac® sensors were normalized and downsampled to match the lowest sampling rate. Four out of 19 of the electrode impedance channels were selected using PCA. Data augmentation of the data was performed in two ways. Firstly, the two sensor readings were used as two distinct instances. Secondly, when downsampling the data, five different starting points were selected. Gao et al. suggest that the signal from both sensors and different downsampling strategies was highly similar, which resulted in overfitting of the CNN model used. The visual CNN model was based on the GoogleNet architecture pre-trained on the Materials in Context Database (MINC). The preprocessing of the visual data consisted of subtracting the mean values from the RGB image and resizing it using a central crop. Finally, both feature vectors resulting from the haptic and visual networks were concatenated and fed into a fully-connected (FC) layer trained with a hinge loss. The performance was evaluated using the area under curve (AUC) metric. The multimodal architecture performed ca. 3% better than the best unimodal network. Moreover, Gao et al. noted that the haptic classifier tends to have a high recall, predicting many adjectives for each class. In contrast, the visual classifier had higher precision. Finally, the multimodal classifier had higher precision and recall than the haptic classifier and higher recall than the visual classifier.

Tatiya and Sinapov (2019) implemented a post-mapping fusion approach on the dataset by Sinapov et al. (2014) described in Sect. 3.3. Tatiya and Sinapov applied a tensor-train gated recurrent unit (TT-GRU) for processing the visual information available in the dataset. Both the acoustic and haptic data in the dataset were processed using a CNN. For the acoustic data, the audio was preprocessed into two channels, the first consisting of the log-scaled Mel-spectrogram and the second of the spectrogram’s derivative. The haptic data was downsampled from 500Hz to 50Hz to align with the video and acoustic data. The multimodal fusion network consisted of the concatenated output vectors of each unimodal network, a fusion layer, and an output layer. Each unimodal network was optimized to recognize the category of the objects. Thus, these networks can be used as stand-alone classifiers or integrated into a multimodal network. Overall, the results were comparable to the earlier work by Sinapov et al. (2014). However, the baseline and the suggested approach have their strengths in different EPs data. Nevertheless, whether such complementary best performance can be attributed to the dataset or the architecture used is unclear. Abderrahmane et al. (2018) applied Zero-Shot Learning to an object classification task, in which a multimodal CNN trained on a set of objects was used to recognize novel objects that were never seen or touched before; relevant semantic attributes (e.g. round, soft, bumpy) were encoded from visuo-tactile data during training and then used to recognize the novel objects, with an accuracy of 72%. Taunyazov et al. (2020) proposed a Visual-Tactile Spiking Neural Network (VT-SNN) that combines information coming from two event-driven sensors: a novel neuromorphic tactile sensor, NeuTouch, and a Prophesee event cameraFootnote 1. The network was trained on two tasks: container classification and rotational slip detection. A comparative experimental analysis showed that the combination of vision and touch performed better than vision or touch alone.

5.1.3 Transfer learning

One of the challenges of transfer learning (co-learning) is that machine learning models are based on the assumption that both training and test data are drawn from the same distribution. However, such an assumption does not hold when transferring knowledge between different robots or sensor modalities. A possible solution is domain adaptation, a.k.a. transfer learning, (e.g., Daumé III and Marcu 2006; Wang and Deng 2018). Here, training samples from a source dataset are adapted to fit a target distribution.

One example of domain adaptation applied to multimodal object recognition was recently presented by Tatiya et al. (2020a). Tatiya et al. used a probabilistic variational auto-encoder network (\(\beta \)-VAE) to cope with missing or defective sensors or new behavioural modalities such as those related to a new exploration procedure. They also implemented a probabilistic variational encoder-decoder network (\(\beta \)-VED) to transfer knowledge from one or multiple robots to another. In both cases, the \(\beta \)-VAE and \(\beta \)-VED were implemented using multi-layer perceptrons, and object classification was performed using an SVM. For testing, the dataset of Sinapov et al. (2014) described in Sect. 3.3 was used. In particular, 15 of 20 object categories were randomly selected for training, and the five remaining were used to test transfer learning between sensory modalities or different behaviours. Tatiya et al. report that such an approach based on \(\beta \)-VAE and \(\beta \)-VED can effectively transfer feature representations from one or more sensory modalities to another with a performance comparable to learning those representations from scratch.

Falco et al. (2019) presented a four-steps visual-to-tactile transfer architecture for object recognition. Firstly, a visuo-tactile common representation based on point clouds was preprocessed to obtain similarly sized representations. In particular, equalizing partiality allowed to filter out the noise and reconstruct missing portions of the surface, and uniforming density was used to downsample the point density while creating a more uniform point density.

Secondly, despite preprocessing, the representation of both modalities is still imperfect. Thus, the redundancy of the information was increased to create a more robust feature set which was later compressed using singular value decomposition (SVD).

Thirdly, transfer learning three methods based on dimensionality reduction were tested, namely, transfer component analysis (TCA), subspace alignment (SA), and geodesic flow kernel (GFK). TCA and SA learn feature representations that are invariant across domains. In contrast, GFK focuses on geometric and statistical changes from the source domain to the target domain.

Finally, for object classification k-NN and SVMs were compared. The architecture was tested with a dataset of 15 objects, including 40 visual and five tactile samples per object. The version using transfer learning based on GFK and an SVM achieved an accuracy of up to 94.7%, comparable to classification results for unimodal object recognition in this dataset. Moreover, Falco et al. (2019) reported that the preprocessing step contributes about 13% of the performance while the GFK transfer learning accounts for 20% of the performance. The other transfer learning methods tested achieved a very low accuracy. A possible disadvantage of the proposed methods is the need for both the source data and (portion of) the target data.

Tatiya et al. (2020b) proposed a framework for knowledge transfer using kernel manifold alignment (KEMA). Manifold alignment aligns datasets and projects them into a common latent space. The local geometry of each manifold is preserved while the correlations between manifolds are extracted. In KEMA, the common latent space was used for training instead of each robot’s raw sensory data, allowing knowledge transfer.

Then two multi-class SVM models with the RBF kernel were trained, one dedicated to speeding up object recognition and the other to recognising novel objects. For the first case of speeding up recognition, Tatiya et al. (2020b) used two source robots with extensive experience of the objects and a novice robot with limited experience. The sensory experience of all three robots was used to build the latent space and train the model. The results showed a delicate balance between the amount of source data used and performance. However, when that balance was met, the target robot performed consistently better than a robot trained only using its own sensory data.

For the novel object recognition case, Tatiya et al. (2020b) used two expert robots and a novice (target) robot having extensive experience with a few objects and no experience with other objects. The sensory data of all three robots were used to train the model. The results showed that KEMA could transfer existing knowledge to the target robot, accurately classifying all unseen objects. Different variations of the experiments showed that the target robot consistently achieved better than chance accuracy. Some of the limitations of this approach were the need to use the target robot’s sensory data for training the model and the need for all robots to perform the same actions on the same objects. Another limitation was that all experiments were performed with simulated robots, and the only haptic difference was the objects’ weight.

Luo et al. (2018) applied maximum covariance analysis (MCA) for crossmodal texture recognition. They introduced the ViTac dataset, consisting of 100 different cloth textures collected with an RGB camera and a GelSight sensor. For MCA, both modalities were preprocessed independently. Then, these features were used to create a covariance matrix, and finally, singular value decomposition (SVD) was applied to reduce the dimensionality. MCA is typically used with handcrafted features to create the covariance matrix. However, Luo et al. used a pre-trained AlexNet, replaced the fully-connected layers, and called their method DMCA. Both visual and tactile data were presented durfing the learning phase. However, only one modality was used for testing. Luo et al. showed that the classification performance of DMCA improves as the output dimension increases, reaching a maximum performance at approximately 25 output dimensions. The classification performance for tactile data was ca. 90%, while the classification performance for visual data was ca. 92.6%. In both cases, these results were ca. 7% better than the unimodal classification case in this data using a pre-trained AlexNet.

Lee et al. (2019a) presented conditional generative adversarial nets (cGANs) to generate visual data from tactile sensory input and vice-versa. They used the ViTac dataset of cloth textures, which consists of 100 different pieces of fabric. The dataset has RGB macro images of the fabrics and tactile readings from a GelSight sensor. The results showed that visual-to-tactile generation achieves a similarity of around 90%. Whereas generation from tactile-to-visual achieved similarities ranging from 50% to 90%. Finally, the classification of both generated and original visual and tactile images achieved an accuracy of ca. 90%. Data augmentation seemed to be a promising direction for some modalities, particularly from a higher dimensional modality like vision to a lower-dimensional one like tactile images.

5.2 Multimodal peripersonal space representation

The peripersonal space (space immediately surrounding the body) is crucial for effective interaction with the environment. Examples of work on this area are presented by Bhattacharjee et al. (2015) in which an iterative algorithm is used to extrapolate haptic labels (force data) to regions of an RGB-D image with a similar colour and depth as those for which the haptic data was explicitly measured. The algorithm operates under the assumption that visible surfaces that look similar to one another are likely to have similar haptic properties. The algorithm can reach an average performance of 76.02% employing 40 contact points in simulation. For haptic categorization, a Hidden Markov Model (HMM) based classification method was employed, which takes force data as input and outputs sparse haptic labels, each with a 2D colour image coordinate. Later, Shenoi et al. (2016) used a dense Conditional Random Field (CRF) to produce a haptic map based on the HMM classification and a vision-based haptic label estimation using a CNN. This approach improved the average performance to 93% for 40 contact points in the simulation. When tested on a foliage environment, the algorithm achieves 82.52% performance after ten reaches.

A cognitive-inspired model for peripersonal space learning presented by Roncone et al. (2016) was implemented on the iCub robot. The model is used to learn approach/avoidance behaviour with the closest body part based on the distance and velocity of the stimuli. The model is fast to learn (a single interaction can already produce a functional representation which can be refined over time), capable of learning distributed representations incrementally, and stimuli agnostic. Thus, the algorithm can be used online and in real time without pretraining. The use of the distributed representation, although overall beneficial, imposes high computational and memory requirements. The current implementation assumes the robot’s kinematics, and the different reference frames transformation is given. Other assumptions include the motor primitives used for learning (i.e., double-touch behaviour). The model’s implementation follows a developmental timeline. It is divided into three phases: starting with data collection through self-exploration or self-touch (motor-tactile stimulation), followed by data from external approaching objects considering time to contact (visuo-tactile stimulation). Finally, learning approach/avoidance behaviours irrespective of whether the perceived stimulus is of motor or visual origin.

Building upon Roncone et al. (2016), Straka and Hoffmann (2017) proposed a model using a Restricted Boltzmann Machine and a feedforward neural network. The stimulus’s position and velocity are estimated visually and represented as a normal distribution to account for uncertainties. The resulting representation is then fed into a feedforward neural network that learns to predict a contact’s location. The model was tested on a simulated 2D scenario and can expand the Peripersonal Space when confronted with fast stimuli. It can also confidently predict contact based only on visual estimations of position and velocity.

5.3 Multimodal object perception for manipulation

Robotic manipulation has a huge impact in many industrial and service applications; visuo-tactile perception has been actively studied to improve the performance of robots, for instance, by allowing more secure object grasping and handling with a lower risk of damaging delicate objects. In the multimodal setting, visual perception is predominantly used for planning reaching trajectories and identifying grasp type and orientation, while haptic perception is typically used for slippage prevention and compliant grasping. The classical way of tackling the problem of grasping has been with model-based, i.e., analytical approaches, and examples of such multimodal perception for grasping and manipulation in the literature are abundant. However, as seen in other fields, recently, there has been a tendency to move from model-based approaches to data-driven ones. In this section, we outline the importance of using both the visual and haptic modality for grasping and manipulation tasks by presenting several recent approaches whose results show that multimodal variants are outperforming the uni-modal ones; see Bohg et al. (2014) for an in-depth survey of older data-driven grasping approaches.

5.3.1 Reaching

Nguyen et al. (2019) proposed a visuo-proprioceptive-tactile integration model for a humanoid robot based on how infants learn to reach for an object. The authors used the iCub robot in simulation, with emulated tactile sensor regions distributed along the left arm and forearm representing the haptics modality, images from the two eye-cameras of the robot representing the visual modality and the configuration of the head, arm and torso joints representing proprioception. The proposed model uses the images from the eye-cameras and its head joints configuration as an input and predicts a list of the torso and arm joints configurations for reaching the object. Convolutional feature extractors were used to extract feature descriptors from the visual input, after which the descriptors from both visual streams were concatenated with the head joints values. The concatenated descriptors were fed to a two-layer MLP, from which a third layer branched out to provide region-specific weights for mapping each of the 22 tactile regions to an input-specific arm-torso joint configuration. The trained model could successfully infer arm-torso configurations to perform region-specific reaching of the object with the arm or the forearm.

5.3.2 Grasping

Once an object is reached, the robot can grip the object and lift it. At this stage, it is crucial to find a good gripper configuration and to apply an adequate force such that the grasp is successful. Calandra et al. (2018) presented a data-driven action-conditioned approach for predicting grasp success that can be used to determine the most promising grasping action based on raw visuo-tactile information. Given an action consisting of 3D motion, in-plane rotation and change of force applied by the gripper, the proposed model uses a midst-mapping fusion strategy to combine the different modalities and predict the grasp outcome. First, the visual input from a Kinect v2 camera and the tactile input from two GelSight sensors attached to the fingers of a Weiss WSG-50 gripper are separately processed by CNNs, while an MLP processes the action channel. Then the latent features were concatenated and fed to an MLP that outputs the probability of successful grasp. The results show that the multimodal variant outperformed uni-modal or hard-coded baselines when grasping previously unseen objects. Furthermore, the qualitative analysis shows that the model learned meaningful grasping strategies for positioning the gripper and what amount of force to apply for successful grasping.

In the same direction, Cui et al. (2020) suggested a visuo-tactile fusion learning method with a self-attention mechanism for determining the grasp outcome. Their model’s architecture consists of three modules: a feature extraction module, a module incorporating visual-tactile fusion and self-attention, and a classification module predicting whether a grasp would be successful. The feature extraction modules for the vision and tactile channel are based on CNNs. The feature fusion module performs a slice-concatenation of the visual and tactile features of particular positions in the corresponding feature maps. Then the self-attention mechanism generates a weighted feature map that learns to determine the importance of different spatial locations. In this way, the overall architecture could learn some aspects of the cross-modal position-dependent features. Finally, the classification module, composed of two fully-connected layers, maps the extracted visuo-tactile features to either a successful or unsuccessful grasp. The authors ran experiments and ablation studies considering different model input variants and tactile signal types, reporting state-of-the-art results on two publicly available datasets.

5.3.3 Maintaining grasping

Once the object is grasped and lifted, slip detection is essential for maintaining a successful grasp. For instance, the gripper force can be adjusted to prevent objects from dropping when a slip is detected. In this direction, Li et al. (2018) proposed a data-driven visuo-tactile model for slip detection of grasped objects based on DNN architecture. Their model uses a sequence of eight consecutive GelSight and corresponding camera image pairs during a grasp-and-lift attempt. Each modality undergoes a separate feature extraction step through a pre-trained CNN, after which the latent features for both modalities are concatenated (midst-mapping) and passed through an additional FC layer. LSTM layers are used on top of the FC layer, and a final FC layer provides the probability that a slip occurred for the duration covered by the image sequence. During the experimental evaluation, several conditions were tested, like the type of image input (raw vs difference images), type of feature extractor (different off-the-shelf CNN models) or the type of information (visual, tactile or visuo-tactile). The best performing model used combined visuo-tactile information, significantly outperforming the unimodal approaches and achieving 88% accuracy in detecting slips on a test dataset of unseen objects.

5.3.4 Multi-stage grasping pipelines

Unlike the previously mentioned end-to-end learning approa-ches, Ottenhaus et al. (2019) proposed a multi-stage pipeline to combine vision and haptic information for finding the most suitable grasp pose. Depth information of the object’s front side and touch information from its backside were fused to construct a precise voxel representation of unknown objects. Next, planners proposed grasp hypotheses, for which a neural network provided scores to decide on the most suitable grasp. Finally, the approach and grasp actions to lift the object of interest were executed. While the authors used existing methods for different parts of the pipeline, their main contribution was the neural network that can propose grasp scores from the voxel representation of the object and the rotation matrix of a grasp pose candidate. The network architecture is an example of midst-mapping fusion, where the output of a CNN feature extractor for the voxel input and an MLP feature extractor for the pose input is concatenated and fed into a final MLP that predicts the probability of a successful grasp. The neural network was trained in simulation, but its performance was validated on a real ARMAR-6 humanoid robot, with a head-mounted Primesense RGB-D camera and a force-torque sensor in the wrist of the robot’s arm used for haptics.

Another multi-stage pipeline was recently proposed by Siddiqui et al. (2021). Firstly, RGB-D sensing from a Kinect V2 camera was used to identify an approximate object pose with a 3D bounding box; then, the motion of a UR5 robot arm was planned to bring a multi-fingered Allegro robot hand equipped with Optoforce fingertip force sensors near to the located object. Finally, a haptic exploration procedure was performed, in which the hand touched the object several times with different tentative grasps, without lifting it, while evaluating a force closure grasp metric at each attempt. The haptic exploration was realized with unscented Bayesian optimization to reduce the number of exploration steps (Nogueira et al., 2016; Castanheira et al., 2018). Unscented Bayesian optimization outperformed both Bayesian optimization and random exploration, i.e., uniform grid search. Overall, this method permitted to find safe and robust grasps for unknown objects without needing any previous learning, but at the cost of requiring considerable time (i.e., in the order of minutes) to haptically explore the object before lifting it.

5.3.5 Contact-rich manipulation

While traditional robotic manipulation is all about avoiding physical contacts with the environment that surrounds the objects, human manipulation is to a large extent about exploiting those contacts, as noted by Deimel et al. (2016). Inspired by this observation, and by the presence of several applied example in industry, such as peg-in-hole insertion tasks (Jiang et al., 2020), the robotics community is showing increased interest in the development of robotic solutions for contact-rich manipulation tasks, as summarised by Suomalainen et al. (2022). Clearly, visual perception is not enough for these tasks, and visuo-haptic integration becomes crucial. As a most notable example, Lee et al. (2019b) recently proposed a system in which a robotic manipulator learns by deep reinforcement learning a control policy that includes sensory feedback from visual (RGB camera), haptic (force/torque sensor) and proprioceptive (motor encoders) sensing. A shared and compact representation of the high-dimensional and heterogeneous multimodal data is learned with a neural network, which is trained to predict optical flow, presence of contact, and concurrency of visual and haptic data; the neural network is then used as sensory feedback to learn a control policy for a peg insertion task, directly on the real robot. The experiments compare four models: no sensory feedback, vision only, haptics only, vision and haptics. Interestingly, while the model with haptics only performs as bad as the one with no feedback, because the robot cannot even pick the peg in most trials, the model with vision only performs the insertion successfully only about 50% of the times, while the model with both vision and haptics brings the success rate to about 75%.

6 Discussion and outlook

Visuo-haptic object perception is a vibrant and dynamic field whose development is crucial for new sensing technologies and applications such as robotic grasping, smart prostheses, and surgical robots. This article highlights many foci of ongoing research from the theoretically and biologically inspired approaches, passing via sensor technologies, data collection, and finally, data processing and applications. However, numerous crucial challenges need to be overcome. This section summarizes and discusses some of these challenges.

6.1 Biologically-inspired approaches

Regarding biological inspiration, the question for robotics is which and in what proportion bio-inspired sensory and data processing principles can help us improve multimodal object recognition in its multiple application areas. Sensor technologies are largely bio-inspired, and there are efforts to incorporate other capabilities, such as measuring humidity, hardness, and viscosity, as well as mimicking other skin properties such as self-healing (Oh et al., 2019). On the contrary, perception models in artificial agents are still largely detached from their biological counterparts. While some biological principles have been explicitly studied, like integration strategies (e.g., Toprak et al. 2018), others like hierarchical processing and input-driven self-organization or processing of object properties rather than sensory modality are some of the promising directions that should be further explored.

6.2 Sensor technologies

Tactile sensing technologies require advancements in several aspects before they can be deployed as easily as cameras. Advancements not limited to the following areas are needed: mechanical robustness, flexibility, compliance, a decrease in electrical connections, sensitivity and reliability of the measurements, the capability of detecting multiple contacts simultaneously, detectability of both normal and shear forces, affordability and ease of manufacturing, as well as ease of electromechanical integration and replacement.

6.3 Data collection and datasets

Collecting tactile data during grasping on a real robot or correctly simulating tactile sensors for synthetic data generation are resource-intensive tasks, which in turn is reflected in datasets’ availability and size. While there are many large-scale vision-only datasets for grasping in real-world scenarios or simulation (e.g., Jiang et al. 2011; Levine et al. 2018; Depierre et al. 2018), only a few small-scale visuo-tactile datasets exist. Thus, large-scale multimodal datasets should be created, considering a variety of objects, grasping scenarios and different tactile sensor types. However, data acquisition from tactile sensors still lacks a unified theoretical framework. The challenges here stem from the fact that haptic perception is an intrinsically sequential process. Moreover, haptic perception is highly dependent on the robot’s embodiment which makes the generalization to other robots or tasks difficult. In addition to a unified theoretical framework for data acquisition, solving other standing computational challenges such as representation learning, mapping and co-learning seem to be key enabling technologies that could help cope with the resource-intensive nature of data acquisition.

Real-world tactile data collection will continue to be the most relevant, and it will also continue to be the most resource-intensive to obtain. In light of recent improvements in the simulation approaches (e.g., Wang et al. (2022); Lin et al. (2022)) that allow generating synthetic data from different tactile sensors or improve the sim2real transfer (e.g., Josifovski et al. (2018); Jianu et al. (2022); Gao et al. (2022); Josifovski et al. (2022)) for visual, tactile or proprioceptive sensing, it is expected that synthetic data gains popularity. Although synthetic data might not be sufficient, it might be a valuable and effective way to move the field forward when combined with small-scale real-world datasets.

6.4 Multimodal signal processing and applications

With regards to signal processing and applications, even though multimodal visuo-haptic approaches for grasping show better results and have the potential to handle use-cases where visual information alone is insufficient, vision-only grasping approaches (e.g., Levine et al. 2018; Mahler et al. 2017; Bousmalis et al. 2018; James et al. 2019) are still more popular. Some reasons for this popularity are that the availability, durability and understanding of vision sensors are better than tactile ones. Moreover, the simulation of vision sensors is easier and more realistic, and the collection, processing and interpretation of visual information are easier than tactile sensor readings. On the other side of the spectrum, there are also recent grasping approaches (e.g., Murali et al. 2020; Hogan et al. 2018) that only use tactile information, but such approaches are usually only suitable for limited scenarios or parts of the grasping process.

Thus, future efforts should be concentrated on multimodal approaches. However, as discussed by Xia et al. (2022), the main challenge is ensuring safety during the physical contact between the object and the robot necessary for tactile sensing. To avoid the hardware dependencies and the safety risks, simulations are a promising alternative to real-world training and data collection for learning-based grasping approaches. However, due to the inaccurate nature of simulations, they cannot completely replace, but they can significantly reduce, the amount of real-world data needed. Finally, fine-tuning on the real system or sim2real techniques (e.g., Ding et al. 2020; Narang et al. 2021) can help to bridge the simulation-to-reality gap.

Another major problem of data-driven and end-to-end learning grasping approaches is that they require a vast amount of training data, in contrast to humans, who learn and generalize from very few examples. In this regard, future work should concentrate on improving the sample efficiency of the algorithms. One option is to include priors in the learning process, e.g., meaningful relations between tactile sensing regions can be incorporated into the model through graph-like structures, e.g., Garcia-Garcia et al. (2019). Another option is combining model-based and model-free techniques for grasping or developing hierarchical and multi-stage approaches. An added benefit of such approaches is that they provide better control over the grasping process and increased interpretability of the model’s behaviour, which is crucial for applications in industrial or collaborative environments alongside humans. Safety is of utmost importance in such environments, and integrating tactile sensors like robotic skin (Pang et al., 2021) can help improve tasks like grasping, prevent injuries, and enable compliant robot control.

7 Conclusion

This article provides a holistic overview of the current state of visuo-haptic object perception for robotic applications. First, it covers the biological basis of multimodal object perception in humans. Second, it summarizes sensor technologies, data collection strategies, and datasets. Third, it introduces the main challenges of multimodal machine learning, focusing on visuo-haptics. Fourth, it presents an overview of different applications. Finally, it presents a detailed discussion of the above points and future research directions for each of them.

Despite the substantial advancements in the understanding and development in all those areas, there are still many open challenges, from the role of biological inspiration in multimodal object perception, to material and mechatronic advances required for the development of better tactile sensing technologies, to the development of better multimodal signal processing methodologies.

Covering the entire field of visuo-haptics for both biological and artificial agents in a single article is difficult. Thus, despite not being exhaustive, the holistic approach to the field presented in this article should provide a unique perspective to the reader on the current state and most pressing challenges that need to be addressed to continue moving the field of visuo-haptic object perception in robotics and its different applications forward.