FuseBot: mechanical search of rigid and deformable objects via multi-modal perception

Mechanical search is a robotic problem where a robot needs to retrieve a target item that is partially or fully-occluded from its camera. State-of-the-art approaches for mechanical search either require an expensive search process to find the target item, or they require the item to be tagged with a radio frequency identification tag (e.g., RFID), making their approach beneficial only to tagged items in the environment. We present FuseBot, the first robotic system for RF-Visual mechanical search that enables efficient retrieval of both RF-tagged and untagged items in a pile. Rather than requiring all target items in a pile to be RF-tagged, FuseBot leverages the mere existence of an RF-tagged item in the pile to benefit both tagged and untagged items. Our design introduces two key innovations. The first is RF-Visual Mapping, a technique that identifies and locates RF-tagged items in a pile and uses this information to construct an RF-Visual occupancy distribution map. The second is RF-Visual Extraction, a policy formulated as an optimization problem that minimizes the number of actions required to extract the target object by accounting for the probabilistic occupancy distribution, the expected grasp quality, and the expected information gain from future actions. We built a real-time end-to-end prototype of our system on a UR5e robotic arm with in-hand vision and RF perception modules. We conducted over 200 real-world experimental trials to evaluate FuseBot and compare its performance to a state-of-the-art vision-based system named X-Ray (Danielczuk et al., in: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, 2020). Our experimental results demonstrate that FuseBot outperforms X-Ray’s efficiency by more than 40% in terms of the number of actions required for successful mechanical search. Furthermore, in comparison to X-Ray’s success rate of 84%, FuseBot achieves a success rate of 95% in retrieving untagged items, demonstrating for the first time that the benefits of RF perception extend beyond tagged objects in the mechanical search problem.


Introduction
There has been increasing interest in robotic systems that can find and retrieve occluded items in unstructured environments such as warehouses, retail stores, homes, and manufacturing (Danielczuk et al., 2019(Danielczuk et al., , 2020;;Boroushaki et al., 2021a, b;Huang et al., 2020).For example, in Fig. 1 RF-visual mechanical search.FuseBot uses RF and visual sensor data (from wrist-mounted camera and antenna) to perform mechanical search and extract the occluded target items from the piles of both RFID tagged and non-tagged items the target.While this category of systems can perform well on relatively small piles, they become inefficient in complex scenarios with larger or multiple piles.The second category of systems leverages radio frequency (RF) perception in addition to vision-based perception (Boroushaki et al., 2021a, b;Wang et al., 2013).Unlike visible light and infrared, RF signals can go through standard materials like cardboard, wood, and plastic.Thus, recent systems have leveraged RF signals to locate fully occluded objects tagged with widely-deployed, passive, 3-cent RF stickers (called RFIDs).By identifying and locating the RFID-tagged target items through occlusions, these systems can make the mechanical search process much more efficient.However, the benefits of existing systems in this category are restricted to scenarios where all target items are tagged, thus providing limited benefit in more common scenarios where only a subset of items are tagged with RFIDs.
In this paper, we ask the following question: Can we design a robotic system that performs efficient RF-Visual mechanical search for both RF-tagged and non-tagged target objects?Specifically, rather than requiring all items to be RF-tagged, we consider more realistic and practical scenarios where only a subset of items are tagged, and ask whether one can improve the efficiency of retrieving non-tagged target items by leveraging RF perception.A positive answer to this question would extend the benefits of RF perception to new application scenarios, such as those where the target item cannot be tagged with inexpensive RFIDs (e.g., metal tools and liquid bottles)1 and instances when the robot is presented with piles of items that are not fully tagged.
We present FuseBot, a robotic system that can efficiently find and extract tagged and non-tagged items in line-of-sight, non-line-of-sight, and fully occluded settings.Similar to past work that leverages RF perception, FuseBot uses RF signals to identify and locate RFID tags in the environment with centimeter-scale precision.Unlike the past systems, it can efficiently extract both non-tagged and tagged items that are fully occluded.As shown in Fig. 1, FuseBot integrates a camera and an antenna into its robotic arm and leverages the robot movements to locate RFIDs, model unknown/occluded regions in the environment, and efficiently extract target items from under a pile independent of whether or not they are tagged with RFIDs.
The key intuition underlying FuseBot's operation is that knowing where an RFID-tagged item is within a pile provides useful information about the pile's occupancy distribution and allows the robot to significantly narrow down the candidate locations of non-tagged items.In its simplest form, knowledge of where an RFID-tagged item is within a pile negates the possibility of another item occupying the same location.Since the in-hand antenna allows the robot to localize all RFID tags in a pile, the robot can leverage this knowledge to narrow down the likely locations of a nontagged target item, and thus plan efficient retrieval policies for these items.
Translating this high-level idea into a practical system is challenging.While the in-hand antenna can locate each RFID as a single point in 3D space, it cannot recover the 3D volumetric occupancy map of the object an RFID is attached to.Since an RFID is attached to the object's surface and not at its center, there is uncertainty about both the position and orientation of the tagged item.The problem is further complicated by the fact that retrieving an occluded item involves manipulating the environment (e.g., by removing occluding objects to uncover the target).Here, uncertainty about the target object's location makes it difficult to identify the optimal manipulation actions to most efficiently reveal and extract the target.
FuseBot introduces two key components that together allow it to overcome the above challenges: (a) RF-Visual Mapping FuseBot's first component constructs a probabilistic occupancy map of the target item's location in the pile by fusing information from the robot's in-hand camera and RF antenna as shown in Fig. 2a.This component localizes the RFIDs in the pile and applies a conditional (shape-aware) RF kernel to construct a negative 3D probability mask, as shown in the red regions of Fig. 2b.occupancy distribution map which is visualized as a heat map.e Fuse-Bot performs instance segmentation of the objects in the environment using the depth information from the camera.f FuseBot optimized its extraction strategy by integrating the 3D occupancy distribution over each of the object segments and efficiently retrieves the target By combining this information with its visual observation of the 3D pile geometry (shown Fig. 2c), as well as prior knowledge of the target object's geometry, FuseBot creates a 3D occupancy distribution, shown as a heatmap in Fig. 2d, where red indicates high probability and blue indicates low probability for the target item's location.In this example, it is worth noting how the probability of the occluded target item is lower near the locations of RFID-tagged objects.Section 4 describes this component in detail, and how it also leverages the geometry of the tagged items and the pile.(b) RF-Visual Extraction Policy After computing the 3D occupancy distribution, FuseBot needs an efficient extraction policy to retrieve the target item.Extraction is a multi-step process that involves removing occluding items and iteratively updating the occupancy distribution map.To optimize this process, we formulate extraction as a minimization problem over the expected number of actions that takes into account the expected information gain, the expected grasp success, and the probability distribution map.To efficiently solve this problem, FuseBot performs depth-based instance segmentation, as shown in Fig. 2e.The segmentation allows it to integrate the 3D occupancy distribution over each of the object segments, and identify the optimal next-best-grasp, as we describe in detail in Sect. 5.
We implemented a real-time end-to-end prototype of FuseBot with a Universal Robot UR5e (Universal Robots, 2021) and Robotiq 2f-85 gripper (Robotiq, 2019).As shown in Fig. 1, we mount an Intel RealSense Depth camera D415 (Intel RealSense, 2019) and log-periodic antennas on the wrist of the robotic arm.Our implementation localizes the RFIDs by processing measurement obtained from the logperiodic antennas using BladeRF software radios (Nuand, 2021).
We ran over 200 real-world experimental trials to evaluate FuseBot.We compared our system to a state-of-the-art system called X-Ray (Danielczuk et al., 2020), which computes a 2D occupancy distribution based on an RGB-D image.Our evaluation demonstrates the following: • FuseBot can efficiently retrieve complex, non-tagged items in line-of-sight and fully occluded settings, across different target objects and number of RFID tags.It succeeds in 95% of trials across a variety of scenarios, while X-Ray was able to extract the target item in 84% of the scenarios.• In scenarios where FuseBot and X-Ray succeed in mechanical search, FuseBot improves the efficiency of extraction by more than 40%.Specifically, it reduces the number of actions needed for successful retrieval from 5 to 3 actions in the median, and from 11 to 6 in the 90th percentile.• Our results also demonstrate that the efficiency gains from FuseBot's RF-Visual mechanical search increase with the number of tagged items in the environment, reaching as much as 2.5× improvement over X-Ray in environments where 25% of (non-target) items are RFtagged and 4× improvement when the target item is tagged.
Contributions FuseBot is the first system that enables mechanical search and extraction of both non-tagged and tagged RFID items in non-line-of-sight and fully-occluded settings.The system introduces two new primitives, RF-Visual Mapping and RF-Visual Extraction, to enable RF-Visual scene understanding and efficient retrieval of target items.The paper also contributes a real-time end-to-end prototype implementation of FuseBot, and an evaluation that demonstrates the system's practicality, efficiency, and success rate in challenging real-world environments.

Related work
Interest in the problem of mechanical search dates back to research that recognizes objects through or around partial occlusions via active and interactive perception.Researchers explored the use of perceptual completion to identify partially occluded objects (Huang et al., 2012;Price et al., 2019), and developed systems that perform active perception whereby a robot moves a camera around the environment in order to search for items that are partially visible (Aydemir et al., 2011;Bajcsy, 1988;Bohg et al., 2017).Other areas of research focused on efficiently grasping partially occluded objects using physics-based planners (Dogar et al., 2012).While these works made significant progress on the task of finding and retrieving partially occluded objects, they do not extend to mechanical search scenarios where the target object is fully occluded.Over the past few years, there has been rising interest in the mechanical search problem for fully occluded objects, whereby the robot actively manipulates the environment to uncover target objects.The majority of systems for mechanical search rely entirely on vision, and employ heuristics or knowledge of the pile structure in order to inform the search process.For example, recognizing that mechanical search is a multi-step retrieval process, pioneering research in this space used a heuristic-based approach to remove larger items in the environment to uncover the largest area and maximize information gain at each step (Danielczuk et al., 2019).More recent work has started looking at the structure of the pile and constrains the potential target item locations by leveraging the geometry of both the pile and the target object (Danielczuk et al., 2020).Other work has also looked at lateral search, where objects are retrieved from the side rather than from a pile (Huang et al., 2020;Avigal et al., 2021).One of the main challenges of this vision-based approach to mechanical search is that as piles become larger and more complex, the uncertainty grows and the systems become more inefficient.FuseBot builds on this type of research to perform efficient mechanical search of fully-occluded objects, and outperforms state-of-the-art past vision-based systems (as we demonstrate empirically in Sect.7) especially in the presence of any RFID tagged item.
Most recently, researchers have explored the use of RF perception to address the mechanical search problem (Boroushaki et al., 2021a, b;Wang et al., 2013).This research was motivated by recent advances in RF localization, which has enabled locating cheap, passive, widely-deployed RFtags (called RFIDs tags) with centimeter scale accuracy, even through occlusions (Ma et al., 2017;Wang & Katabi, 2013;Luo et al., 2019).Thus, by tagging the target object with an RFID, researchers have demonstrated the potential to perform efficient mechanical search by directly locating the target RFID-tagged item in a pile, bypassing the exhaustive search altogether.However, these past systems require the target item to be tagged with an RFID to enable efficient mechanical search and retrieval.Our work is motivated by this line of work, and is the first to bring the benefits of RF perception to non-tagged target items, leveraging the mere existence of RFID tagged items in the pile.

System overview
We consider a general mechanical search problem where a robot is tasked with retrieving a target item from a pile.The target item may be unoccluded, partially occluded, or fully occluded from the robot's camera.
We focus on scenarios where one or more items in the pile are tagged with UHF RFID (Radio Frequency IDentification) tags, but where the target item does not need to be tagged with an RFID.We assume that the robot knows the shape of the tagged item, and has a database with the shapes of all RFID-tagged items.Such a database may be provided by the item's manufacturer.The robot is a 6-DOF manipulator with a camera and an antenna mounted on its wrist, and we assume that the target item is kinematically reachable from the robotic arm on a fixed base.
FuseBot's objective is to extract the target(s) from the environment using the smallest number of actions.It starts by using its wrist-mounted antenna to wirelessly identify and locate all RFIDs in the pile, even if they are in non-line-ofsight.Using the RFID locations and its visual observation of the pile geometry, it performs RF-Visual mechanical search in two key steps.The first is RF-Visual Pile Mapping, where FuseBot creates a 3D probability distribution of the target object's location within the pile.The second is RF-Visual Extraction, where the robot uses the probability distribution and its scene understanding to perform the next-best grasp.The next two sections describe these steps in detail.

RF-visual pile mapping
In this section, we explain how FuseBot creates a 3D occupancy distribution of a target item's location in a pile.The process of RF-Visual mapping consists of four key steps where the robot first constructs separate RF and visual maps, then fuses them together, and finally folds in information about the target object's geometry.For clarity of exposition, we focus our discussion on scenarios where the target item is both occluded and non-tagged, and discuss at the end of the section how this technique generalizes to unoccluded and/or non-tagged items.

Visual uncertainty map
The first step of RF-Visual pile mapping involves constructing a 3D visual uncertainty map of the environment.This map is important to identify all candidate locations of an occluded object.To create the visual uncertainty map, the robot moves its downward pointing wrist-mounted camera above the pile to cover the workspace.It follows a simple square-based trajectory in a plane parallel to the table with a pile, similar to past work that constructs point clouds of piles (Boroushaki et al., 2021b).
FuseBot combines the visual information obtained during its trajectory using an Octomap structure (Hornung et al., 2013).The structure represents the 3D workspace as a voxel grid.2Using depth information and the position of the camera, FuseBot can determine whether each voxel in the environment is visible to the camera (the surface of the pile and table), or free space (the air), or occluded (e.g., under the pile or table).Formally, it creates a 3D uncertainty matrix C(x, y, z) as follows: Here, the higher value (i.e., 1) represents more uncertainty.It is worth noting that, in this representation, both unexplored and occluded regions are considered uncertain.
As an example, consider the sample scenario shown in Fig. 1.This scenario consists of two piles with three RFIDtagged items, and where the target item is a toy (stuffed red turtle shown in the top center) hidden under the pile.The visual uncertainty map is depicted as a heatmap in Fig. 3a.Here, we can see that the regions under the surface of the piles have a high probability (red) of containing the target object.

RF localization
So far, we have explained how FuseBot constructs a 3D uncertainty map based on the camera's depth information.Next, we explain how it accurately localizes RFIDs to gain more information about the environment.For simplicity, we first describe the localization of a single tag, then describe how we support multiple tags.Our localization system follows three steps: Step (1) Measuring RFID response First, recall that Fuse-Bot has a wrist-mounted antenna which it uses to perform RF perception.The antenna is used to read and localize RFID tags in the pile.When the antenna transmits radio frequency signals, passive RFID tags harvest energy from this signal to power up and respond with their own identifier.FuseBot then uses these responses to estimate the channel, which contains Formally, if an RFID transmits a signal x(t), and the received signal is y(t), one can estimate the wireless channel ĥ( f i ) as: The above describes the channel estimation at a single frequency f i .FuseBot repeats this process at multiple frequencies to obtain { ĥ( f i )} i Step (2) Leveraging robot mobility for localization Since channel measurements from a single location are not enough to localize an RFID tag in 3D space, FuseBot leverages robotic mobility to collect measurements from different vantage points and combines them to localize the tag.Since FuseBot already requires a scan of the environment to build the Visual Uncertainty map in Sect.4.1, we leverage this motion and continuously collect RFID channel measurements as the robot moves, allowing us to collect a set of measurements: where p a k is the location of the antenna.Figure 4 schematically shows the robot moving and collecting RF measurements in order to localize an RFID tag that is hidden under a pile.The red dotted lines demonstrate the RF signals that are transmitted from the wrist mounted antenna to the RFID tag and then received by the wrist mounted antenna.Remember that unlike visible light, RF signals can traverse through occlusions, and, as a result, the RF channel can be estimated even when the RFID tag is under the pile.
Step (3) Combining measurements Finally, given these measurements, the robot can combine them using a technique called Synthetic Aperture Radar (Curlander & McDonough, 1991).This localization method combines measurements across space and frequency (i.e., { ĥ( f i , p a k )} i,k ) to estimate the probability of the tag being at each point in 3D space.This can be done using the following equation (Curlander & McDonough, 1991): (1) where P( p) is the estimated probability at point p, p a k is the antennas position at the time of the i th measurement, and d( p, p a k ) is the round-trip distance from point p to point p a k .
The final tag location is then estimated to be the location in space with the highest probability: where p R F I D is the estimated location of the tag.
To extend this to any number of RFIDs, we modify step 2 as follows.Instead of continuously reading one RFID, we estimate the channels of all RFID tags in the environment sequentially as the robot is moving.3This allows us to collect a set of measurements for each RFID.We then recompute Eqs. 1 and 2 for each RFID in the environment.
Finally, it is worth noting that wireless noise may lead to localization errors.FuseBot's design incorporates a confidence metric (described below) to identify and mitigate such errors.Specifically, if the confidence metric is low, the system can choose either to ignore the corresponding tag altogether or to take more RF measurements that enable it to increase its localization confidence.
To understand whether we have confidently localized an RFID, we leverage information from the probability computed in Eq. 1.For simplicity of exposition, we demonstrate this idea in Fig. 5, which shows a two-dimensional heatmap of the probability, where yellow indicates a higher likelihood of the RFID being located in that location and blue indicates a lower likelihood.We consider two cases.In Fig. 5a), the heatmap shows a small area of high probability surrounding the tag's location (denoted by a green x), which indicates a high level of confidence in the RFID location.On the other hand, Fig. 5b) shows a case where there is a large area of yellow, so FuseBot has a low confidence in the location of the RFID.
To quantify this phenomenon, FuseBot computes the bounding box around the area of the heatmap that is within 0.75dB (∼84%) of the peak value (shown by δ x and δ y in Fig. 5).When these dimensions have fallen below a threshold, FuseBot declares the RFID confidently localized.Formally, FuseBot's criteria for declaring a successful RFID localization is: where δ x , δ y , and δ z are the x, y, and z dimensions of the bounding box around the area of the power where P( p) > 0.84 max[P( p)].τ x , τ y , and τ z are the thresholds in the x, y, and z dimenions, respectively.

RF certainty map
Next, we explain how FuseBot leverages the estimated RFID locations from the above section to construct a certainty map based on RF measurements.
FuseBot uses the RFID tag locations to identify regions in the pile that the target item is less likely to occupy, since they are occupied by the RFID-tagged items (rather than the nontagged target item).A key challenge here is that the system can only recover the RFID tag's location as a single point in 3D space.Since an RFID is attached to the surface of the tagged item, there remains nontrivial uncertainty about the orientation and exact position of the item in the pile (as it may occupy a non-trivial region in the near vicinity of the localized tag).RF Kernel FuseBot encodes the uncertainty about the RFIDtagged object's location by constructing a 3D RF kernel that leverages the known dimensions of the tagged object.The RF kernel is modeled as a 3D Gaussian, centered at the RFID tag, and masked with a sphere whose radius is equal to the longest dimension of the tagged item.The spherical mask represents an upper bound on the furthest distance from the tag that the object can occupy.Formally, we represent its RF kernel through the following equation: where p is the point where we are evaluating the kernel, p R F I D is the location of the RFID, d s and d l are the shortest and longest distance of the RFID tagged object's bounding box respectively, and • 2 represents the L2 norm.Here, it is worth noting that the negative sign represents the negative likelihood for the target item to occupy the corresponding region.
In the presence of multiple RFID tagged items, the RF certainty map is a linear combination of all RF kernels where N is the number of RFID tagged items in the environment.p i is the ith RFID location, and m( p, p i ) is the ith RF kernel.
The RF certainty distribution for the example scenario (described in Fig. 1) is shown in Fig. 3b.Since there are three RFID-tagged items in the pile, the figure shows three spherical regions that represent the Gaussians centered at each of the localized RFIDs.RF-Visual Uncertainty Map: Given both the visual uncertainty map and the RF certainty map, FuseBot constructs an RF-Visual uncertainty map by adding the two maps pixelwise (i.e., C + R).In the above example with two piles and three RFID-tagged items, Fig. 3c shows the resulting RF-Visual uncertainty map.Notice how by applying the RF masks as a negative mask to the voxel grid values, FuseBot folded the certainty gained from RF into the uncertainty from the visual information.

RF-visual occupancy distribution map
So far, we have described how FuseBot constructs a 3D probability distribution of possible locations of the target item by fusing RF and visual information.Next, we describe how FuseBot also leverages the target item's size and shape to further improve the occupancy distribution map.Intuitively, the target's size constrains the potential regions it can occupy in the occluded region since, for example, larger targets cannot fit into narrow regions of the pile.
To fold the target size into the distribution, FuseBot employs a similar approach to the RF kernel described in Sect.4.3.Specifically, it creates a target occupancy kernel that summarizes all the possible orientations of a target object using the following target gaussian kernel: where p is the point where we are evaluating the kernel, d s and d l are the shortest and longest distance of the target object bounding box respectively, and • 2 represents the L2 norm. 4  To combine the geometric data from this target gaussian kernel with the previously computed RF-Visual uncertainty map, FuseBot performs a 3D convolution of the RF-Visual uncertainty map and the target's gaussian kernel.Intuitively, after convolution, the regions that can fit the item of interest in more possible orientations will have voxels with higher weights than other regions of the unknown environment.Hence, the resulting 3D occupancy distribution now encodes the visual uncertainty, RFID tagged items, and the shape and size of the target item.
Figure 3d shows the resulting RF-Visual occupancy distribution from this convolution operation (for the scenario described earlier in Fig. 1).Notice that in this distribution, regions near the RFID tags, as well as those near the edge of the pile, have lower probabilities (blue/white) than other regions in the pile.

Generalizing to other scenarios
Our discussion so far has focused on the case of a fullyoccluded non-tagged target item.The method can be generalized to other scenarios in a number of ways:

Tagged target object
In scenarios where the target object is tagged with an RFID tag and is not in the line of sight, FuseBot uses the calculated RF kernel in order to build the occupancy distribution of the RFID tagged target object.The RF kernel in this case is positive and the visual uncertainty is ignored.FuseBot in this case knows where the target object is and declutters the environments efficiently to extract the target object.

Unoccluded target object
In cases where the target object is unoccluded (or partially occluded), FuseBot can leverage prior approaches for identification and grasping to retrieve the target item from the pile 4 One interesting difference between the RF kernel and the target kernel is that the RF kernel is larger since the RFID tag is on the surface of the object, while the target item kernel is defined from the object's center (d l for the RF kernel vs d l /2 for the target kernel).(Chen et al., 2020;Danielczuk et al., 2019;Krizhevsky et al., 2012;Liu & Deng, 2015).

Deformable RFID tagged objects
In principle, FuseBot's probabilistic approach described so far allows it to operate with deformable objects.However, to further improve the efficiency for such objects, we designed a more advanced model.Recall that from the recorded data in the RFID dataset, FuseBot knows if an RFID tagged object is deformable or rigid.Specifically, when a deformable RFID tagged object is present under a pile, it is likely to compress, changing the object's dimensions.This compression causes the object to deviate from the model of the existing RF kernel.FuseBot can leverage this observation to update the RF kernel for such deformable objects.Specifically, instead of using a spherical RF kernel as mentioned in Sect.4.3, which is more representative of rigid objects whose dimensions are fixed, we introduce a Deformable RF Kernel.
We demonstrate this concept in Fig. 6. Figure 6a shows the RFID tagged object before it was deformed.Figure 6b shows the same RFID tagged object under a pile, deformed due to the weight of the rest of the pile.Figure 6c shows the original spherical RF kernel with variance σ = d s /2 (as described in 4.3), with blue indicating more negative and red indicating more positive probability.This RF kernel is overlayed with the compressed deformable object that the kernel is attempting to model.In this case, the model poorly aligns with the object.Instead, Fig. 6d shows the new deformable RF kernel.The variances of the Gaussian are updated to create an elliptical kernel, better matching the expected shape of the object.
Formally, we first define a deformation factor for the RFID tagged object, α(ρ, z) ∈ [0, 1], which estimates how deformed the object is.Here, α(ρ, z) = 0 represents a fully deformed object and α(ρ, z) = 1 represents a non-deformed object: where ρ ∈ {0, 1} is 1 if the object is rigid and 0 if the object is deformable, z is the height of the RFID location from the table surface, and z max is the maximum height of the pile directly above the RFID tag location. 5Then, we define the deformable RF kernel as: where is the covariance matrix, and σ x , σ y , and σ z are the variances in the x, y, and z dimensions, respectively:

RF-visual extraction policy
In the previous section, we explained how FuseBot builds a 3D RF-Visual occupancy distribution for a target item's location.Given this distribution, one might think that the robot could immediately move towards the voxel with the highest probability to extract the target object.However, since the target object is fully occluded, the robot cannot directly access it.Instead, it must first remove anything covering the target object.In this section, we describe FuseBot's RF-Visual extraction policy that decides which object to remove in order to most efficiently extract the target object.The goal of designing the extraction policy is to minimize the overall number of actions required to retrieve the target object.If the robot was certain of the target item's location, it could simply remove anything covering the object, then extract the target object.However, while FuseBot leverages RF-Visual perception to minimize uncertainty, the occupancy distribution may still have multiple areas of high probability, leaving ambiguity in the target item's location.One could think of moving towards the region with the highest probability and searching for the target object there until it either finds the object or eliminates the search area.However, this may result in an inefficient search, especially in complex scenarios, where there are multiple large piles.Thus, to enable efficient retrieval, FuseBot needs an extraction policy that not only leverages the probability distribution of the target item's location but also the expected information gain of a given action and the likelihood of a successful grasp action.
At the core of enabling an efficient retrieval policy is identifying the next best object to grasp.To this end, FuseBot transform its voxel-based representation of the environment into an object-based representation, which assigns a certain expected gain for grasping each of the visible objects.To do this, FuseBot performs instance segmentation which gives the mask and surface area of each visible object in the scene, as shown in Fig. 7a.Next, in Fig. 7c, it vertically projects all the voxels below a given mask onto the mask and integrates over the mask area.In principle, this provides it with the total utility of extracting the corresponding item (including both the probability distribution and information gain).
Note however that the approach of simply projecting all the probability below an object onto the surface assumes that removing that object would reveal all the voxels below it.In practice, this is not true because the object only has a limited thickness.While FuseBot does not know the thickness of each item, we can safely assume that voxels near the top of the pile are more likely to be eliminated when an object is removed.To bias the search towards this information gain, FuseBot applies a weighting function that increases the weights of voxels closer to the surface of the pile.The sum of these weighted probabilities, or score of each mask, now optimizes for both the information gain and probable tag locations for each visible object.The score is formalized in the below equation: where s i is the score of mask i, m i is all (x,y) points contained within the ith mask, z m i is the maximum z under the ith mask, and p x,y,z is the probability from the occupancy distribution for point (x,y,z).γ is the discount factor for weighting the probability.6 Incorporating Grasp Quality While these scores incentivize both exploiting the probability distribution and maximizing information gain, they do not account for the likelihood of failed grasping attempts.To do this, FuseBot computes the probability of a successful grasp for each point in the environment using a grasp planning network.FuseBot then selects the best possible grasp within each object mask.
The grasp qualities of each mask are formalized in the below equation: where g i is the best grasp probability for the ith mask, g(x, y) is the grasp probability for point (x,y) given by the grasping network, and m i is all (x,y) points contained within the ith mask.
FuseBot now uses the grasping quality and mask scores to find the optimal extraction policy by optimizing for the following: where i is the mask number and τ is the threshold for acceptable grasping quality.g i and s i are the grasping quality and the score for the ith mask, and . is the ceiling function.FuseBot first evaluates objects with a greater than τ grasp quality, selecting the object with the best weighted probability score. 7If no high probability grasps are available, it then selects the object with the best score regardless of grasp quality.The overall algorithm is summarized in Alg. 1.
A few additional points are worth noting: • Since the workspace may be larger than the field of view of the robot's camera, FuseBot begins by clustering the occupancy distribution and selecting the area with the highest average probability.The robot moves over this area before computing the object masks and grasp qualities and executing the RF-Visual extraction policy.This ensures that FuseBot can extend to any size workspace within the robot arm's reach.

Implementation
Physical Setup.We implemented FuseBot on a Universal Robots UR5e robot (Universal Robots, 2021) with a Robotiq 2F-85 gripper (Robotiq, 2019).We mounted an Intel Realsense D415 depth camera (Intel RealSense, 2019) and two WA5VJB Log Periodic PCB antennas (850-6500 MHz) (Kent Electronics, 2021) on the gripper.The antennas are connected to two Nuand BladeRF 2.0 Micro software radios (Nuand, 2021) through a Mini-Circuits ZAPD-21-S+ splitter (0.5−2.0 GHz).To obtain RFID locations, we implemented an RFID localization module using the wrist mounted antenna and BladeRFs through a similar method as past work (Ma et al., 2017;Boroushaki et al., 2021b).We used standard off-the-shelf UHF RFID tags (the Smartrac DogBone RFID (Inlay, 2021)) that costs around 3-5 cents.Control Software The system was developed and tested on Ubuntu 20.04 and ROS Noetic.We used MoveIt [31] as the inverse-kinematic solver to control the robot through the UR Robot Driver package (Universal Robots ROS Driver, 2020).The visual map of the environment is created using Octomap (Hornung et al., 2013).We used Synthetic Depth (SD) Mask R-CNN (Danielczuk et al., 2019) to perform instance segmentation of the scene and segments objects in the scene.To predict the grasping quality from the depth images, we used GG-CNN (Morrison et al., 2018a, b).The baseline, X-Ray (Danielczuk et al., 2020) was implemented based on the published code (Danielczuk et al., 2021).

Real-world evaluation scenarios
We evaluated FuseBot in a variety of real-world scenarios with varying complexity, some of which can be seen in Fig. 8.The scenarios had between 1 and 3 distinct piles of items, 0-10 RFID tagged objects, and a variety of target object and RFID tagged object sizes.Each experiment had one target item and 10-40 other distractor objects.Experiments included varying distances between the target item and the nearest RFID tagged item, including setups with an RFID tagged item touching the target item, RFID tagged items in the same pile as the target item, or all RFID tagged items in different piles than the target item.We also evaluated Fuse-Bot in scenarios where the target object was tagged with an RFID. 8imilar to prior work (Danielczuk et al., 2020) that uses color-based object identification for simplicity, the target item is a red item and FuseBot uses an HSV color segmentation to identify when the target item is in line-of-sight.We note that this step can be replaced by any target template matching network such as the one used in Danielczuk et al. (2019) to identify target objects of any type.
We use everyday objects, both deformable and solid, in our evaluation, including office supplies, toys, and household items like gloves, beanies, tissue packs, travel shampoo, stuffed animals, and thread skeins.

Baselines
We compared FuseBot's performance with X-Ray (Danielczuk et al., 2020).X-Ray works by estimating 2D occupancy distributions and selecting the object with the highest total probability within its mask to pick up.X-Ray relies entirely on visual information and has no mechanism for RFperception.

Number of actions
We measured the number of grasping actions that were needed to extract the target item from the environment.Actions include grasping a non-target object, target object, or failing to grasp anything.Success rate We also evaluated the success rate of our system and the baseline.An experimental trial was considered a failure if the robot performed 15 actions and failed to retrieve the target item, or if the robot performed 5 consecutive grasping attempts that failed to grasp any item.Search and retrieval time We measured the time during which the robot was moving in each successful mechanical search and retrieval task.For FuseBot, this time included the scanning step required to localize the RFIDs.

Baseline comparisons
We evaluated FuseBot and X-Ray in 181 real-world experimental trials.The experiments covered multiple different scenarios of various complexities with 1-3 piles, 0-10 RFID tagged items, and different target object sizes.We tested X-Ray and FuseBot in the exact same scenarios, but we repeated FuseBot multiple times in each scenario with different combinations of RFID tagged item locations and numbers.We measured the number of actions it took to find and retrieve the target item, the success rate of each system, and the search and retrieval time for each system.Recall from Sect.7(c) that an experimental trial is considered successful if the robot can find and retrieve the target item within 15 actions.

Overall number of actions
Table 1 shows the 10th, 50th, and 90th percentiles of the number of actions required to find and extract the target object.It includes results from FuseBot with RF-tagged target objects, FuseBot with non-tagged target objects, and X-Ray.We make the following remarks: • FuseBot needs only 3 actions at the median to retrieve non-tagged target item, improving 40% over X-Ray's median number of actions of 5.This shows that FuseBot is able to retrieve non-tagged target items more efficiently than the state-of-the-art vision-based baseline across a variety of scenarios.• The 90th percentile of FuseBot with non-tagged items is 6 actions, while X-Ray's 90th percentile is 11 actions.This shows that FuseBot is able to perform more reliably, with a 45% improvement over the state-of-the-art at the 90th percentile.• When searching for a tagged target item, FuseBot requires only 2 actions on median, and 5 actions for the 90th percentile.Note that here it performs better than extracting a non-tagged item.This is expected because localizing the tagged target item reduces the uncertainty about its location and makes mechanical search more efficient.This result shows that FuseBot's performance matches that of past state-of-the-art systems that are designed to extract RFID-tagged items (Boroushaki et al., 2021b) 9 ; moreover, unlike these prior systems, Fuse-Bot's benefits also extend to non-tagged items.

End-to-end success rate
Table 1 reports the end-to-end success rate.The results show that FuseBot is able to retrieve the target item 95% of the time for non-tagged and tagged target objects, while X-Ray is only able to do so in 84% of scenarios.This demonstrates that FuseBot not only improves the efficiency, but also the success rate of mechanical search.The table shows the success rate as well as the 10th, 50th, and 90th percentiles for the number of actions for both FuseBot and X-ray.The performance of FuseBot is shown for scenarios where the target item is tagged and where it is non-tagged The table shows the 10th, 50th, and 90th percentiles for the search and retrieval time of both FuseBot and X-Ray

Search and retrieval time
Table 2 shows the search & retrieval time for both FuseBot and X-Ray.Here, it is worth noting that the robot was programmed to move at the same speed across all experimental trials.We make the following remarks: • FuseBot only requires 62 s at the median, while X-Ray's median is 142 s, showing more than 2x improvement over the baseline's performance.• The 90th percentile of FuseBot is 132 s, while X-Ray requires a 90th percentile of 237 s, showing the improvement in reliability of FuseBot over X-Ray.• This improvement in search & retrieval time shows that FuseBot is more efficient than the baseline despite requiring an additional scanning step.

Scenario complexity
We evaluated FuseBot for non-tagged target objects and X-Ray across three scenarios of different complexities.
• In the first level of complexity, the systems were evaluated on a setup with 2 distinct piles of objects and a total of 20 distractor objects.• In the second level of complexity, the systems were evaluated on a setup with 3 distinct piles of objects and a total of 25 distractor objects.• In the third level of complexity, the systems were evaluated on a setup with 3 distinct piles of objects and a total of 42 distractor objects.
Figure 9a plots the number of actions required to find and retrieve the target object for both FuseBot (green) and X-Ray (blue) across three scenarios of different complexities.The error bars indicate the 10th and 90th percentiles.We make the following remarks: • Across all levels of complexity, FuseBot outperforms the baseline in terms of both its median and 90th percentile efficiency.This shows that the benefits of RF-perception extends to complex scenarios.• In more complicated scenarios with a larger number of distractor objects, both FuseBot and X-Ray require more actions to retrieve the target item.Interestingly, for more complex scenarios, FuseBot's efficiency gains increase over the baseline.

Microbenchmarks
In addition to baseline comparisons, we performed microbenchmarks to quantify how different factors impact the performance of FuseBot.

Number of RFID tagged items
Recall from 4.3 that FuseBot creates an RF kernel for each identified and localized RFID tagged item, and uses the kernels to build the occupancy distribution.The occupancy distribution gives FuseBot better insight into the location of the target item.We quantified how the system performs with different numbers of RFID tagged items through 54 experiments in the same scenario with varying numbers of RFIDs.
In this scenario, we have 3 different piles with a total of 25 objects.
Figure 9b plots the number of actions required to retrieve the target item vs. the number of localized RFIDs in the environment for FuseBot (green) and X-Ray (blue).The error bars denote the 10th and 90th percentiles.Since X-Ray does not utilize RFIDs, the results are not separated by number of RFIDs.We make the following remarks: • As the number of localized RFIDs in the environment increases, FuseBot's median number of actions decreases, dropping from 4 with no RFIDs to 2 with only 6-9 RFIDs.This improvement in efficiency is expected, because additional RFID tagged items increase the number of RF kernels, which in turn narrows down the candidate locations for the non-tagged target item.More generally, this result shows that leveraging RF perception improves the efficiency of mechanical search, and that the improvement is proportional to the number of RFID tagged items.• Interestingly, even with 0 RFIDs, FuseBot outperforms X-Ray.Specifically, it requires a median of only 4 actions, while X-Ray requires 7 for the same scenario.This is due to two main reasons.First, while FuseBot leverages a 3D distribution, X-Ray only uses a 2D probability distribution which does not account for the height of different objects.Second, unlike FuseBot, X-Ray does not account for grasp quality when selecting an object to remove from the pile.This makes it susceptible to choosing objects that are more difficult (hence less efficient) to grasp.

Distance from nearest RFID to target item
Our next microbenchmark aims to investigate whether the presence of an RFID-tagged item near the target item would impact the performance.Specifically, one concern with applying the negative mask is that it biases the extraction policy away from the RFID-tagged item.To investigate this, we ran 51 real-world experiments across three scenarios: • Touching In this category, there is at least one RFID tagged item in direct contact with the target item.
• Opposite Side of Pile In this category, all RFIDs are either on the opposite side of the target item's pile or in different piles than the target item.• Different Piles In this category, all RFIDs are in different piles than the target object.
Figure 9c plots the median number of actions required to find the target item in each of the three categories of scenarios described above, shown in green.The error bars denote the 10th and 90th percentiles.For comparison, the blue bar show the performance of X-Ray in the same scenario.Since X-Ray does not leverage RFIDs, its performance is not separated into different categories.
We make the following remarks: • Different Piles, Opposite Side of Pile, and Touching require only 2, 3, and 3 actions at the median, respectively.However, X-Ray requires 7 actions to retrieve the target item.This shows that FuseBot outperforms the baseline across all categories of scenarios, even when an RFID tagged item is touching the target object.• In Touching, the median number of actions is similar to Different Piles and Opposite Side of the Pile, however the 90th percentile is worse.This is expected because the negative RF mask biases the search away from the target object.However, it is important to note that the 90th is only 5 actions.

Impact of extraction policy
Next, we evaluate the benefits of FuseBot's RF-Visual extraction policy.To do so, we compare to the performance of a naive extraction policy.Unlike FuseBot's policy, this naive policy is designed such that the robot is unaware of the individual objects on the pile, and therefore does not have a way to estimate the expected information gain of removing an item.This naive policy operates in two steps: first, it The table shows the 10th, 50th, and 90th percentiles of the number of actions of FuseBot with different extraction policies selects the voxel with the highest probability in the RF-Visual occupancy distribution (from RF-Visual Mapping); then, it performs the best grasp that is within 5 cm of the voxel's projection on the surface of the pile.
Table 3 shows the 10th, 50th, and 90th percentiles of the number of actions required to successfully extract the target item for FuseBot with both extraction policies for the same set of scenarios with a fully-occluded untagged target item.The result shows that the RF-Visual extraction policy allows FuseBot to successfully complete the task with 2.5 median actions.In contrast, when using the naive extraction policy, it requires 4 median actions.Furthermore, the 90th percentile of FuseBot's extraction policy is only 4 actions, while the naive policy requires 6.9 actions.This performance improvement is due to the fact that FuseBot's RF-Visual extraction policy optimizes for information gain, allowing it to search the environment more efficiently than the simpler extraction policy.

Impact of deformable RF kernel
Recall from Sect.4.5 that FuseBot can leverage deformable RF kernels to more accurately model deformable RFID tagged objects.The aim of this benchmark is to evaluate the performance improvement of this model.We evaluated Fuse-Bot with both spherical and deformable RF kernels.We ran 20 trials across multiple scenarios where at least one RFID tagged item was deformable and FuseBot was tasked with retrieving a non-tagged target item that was fully occluded under the piles.In order to ensure a fair comparison, we did not include failed grasp attempts in the total number of actions for this microbenchmark as they were caused by grasping network errors rather than RF Kernels.
Table 4 compares the number of actions needed to retrieve target item when using deformable RF kernels compared to spherical RF kernels.We make the following remarks: • FuseBot with deformable RF kernels retrieved the target object with median of 3.0 actions and 90th percentile of 4.0 actions.However, FuseBot with spherical kernel required a median of 4.0 actions and 90th percentile of 6.2 actions to finish the same tasks.This demonstrates The table shows the 10th, 50th, and 90th percentiles of the number of actions that FuseBot needed to finish the retrieval tasks with deformable RF Kernels and with spherical RF Kernels that accounting for object deformability in RF kernels further improves the system's efficiency.• Importantly, FuseBot with spherical kernels was still able to successfully retrieve the target object in all trials.This shows that despite decreased efficiency, FuseBot's probabilistic approach still allows for successful task completion despite inaccurate kernel models.

RFID localization accuracy
In our final microbenchmark, we evaluated the accuracy of FuseBot's RFID localization over 37 experiments.To evaluate the impact of occlusions on RFID localization accuracy, we computed the error in two cases: one where the tag was in line-of-sight (LOS) to the antennas and one where the tag was in non-line-of-sight (NLOS) (e.g., covered by clothes, stuffed animals, etc).We used the Optitrack motion capture system (Optitrack, 2017) to obtain accurate ground truth locations.Since the RF signal can emanate from any position on the RFID tag, we measure the error as the L2 norm between the estimated RFID location and the nearest point on the RFID tag. 10  Table 5 shows the RFID localization accuracy in LOS, NLOS and overall.We make the following remarks: • FuseBot is able to accurately localize RFIDs, achieving a median of 3.6 cm and a 90th percentile of 6 cm of error.We note that this level of error is typically less than the dimensions of the object to which the RFID object is attached, allowing FuseBot to accurately model the environment.We also note that FuseBot's probabilistic approach is specifically designed to account for these small errors.• The localization accuracy in LOS and NLOS scenarios is very similar, with the median error increasing by less than half a cm and the 90th percentile increasing by 1 cm in NLOS scenarios.This is expected since RF signals can go through most occlusions, and this matches results reported in state-of-the-art RFID localization work (Boroushaki et al., 2021b).

Discussion and conclusion
This paper presented FuseBot, the first RF-Visual mechanical search system that leverages RF perception to efficiently retrieve both RF-tagged and non-tagged items in the environment.The paper presents novel primitives for RF-Visual mapping and extraction and implements them into a real-time prototype evaluated in practical and challenging real-world scenarios.Our evaluation demonstrated that the mere existence of RFID-tagged items in the environment can deliver important efficiency gains to the mechanical search problem.Our evaluation of FuseBot in end-to-end retrieval tasks also revealed a number of interesting insights.While Fuse-Bot's design focused on retrieving untagged target items, our results showed that its efficiency in extracting RFID tagged target objects matches that of state-of-the-art RF-Visual mechanical search systems that can only extract RFID-tagged objects.Our evaluation also showed that Fuse-Bot is successful and efficient in performing mechanical search across piles with deformable objects.
In conclusion, with the rapid and widespread adoption of RFID tags across various industries, this paper uncovers how RF perception can play a role in making robotic tasks more efficient and reliable for various industries such as warehousing, manufacturing, retail, and others.Fadel Adib is an Associate Professor in the MIT Media Lab and the Department of Electrical Engineering and Computer Science.He is the founding director of the Signal Kinetics group, which invents wireless and sensor technologies for networking, health monitoring, robotics, and ocean IoT.He is also the founder & CEO of Cartesian Systems, a spinoff from his lab that focuses on mapping the physical world at unprecedented scale.Adib was named by Technology Review as one of the world's top 35 innovators under 35 and by Forbes as 30 under 30.His research on wireless sensing (X-Ray Vision) was recognized as one of the 50 ways MIT has transformed Computer Science, and his work on robotic perception (Finder of Lost Things) was named as one of the 103 Ways MIT is Making a Better World.Adib's commercialized technologies have been used to monitor thousands of patients with Alzheimer's, Parkinson's, and COVID-19, and he has had the honor to present his work to multiple heads of state, including President Obama at the White House.Adib is also the recipient of various awards including the NSF CAREER Award (2019), the ONR Young Investigator Award (2019), the ONR Early Career Grant (2020), the Google Faculty Research Award (2017), the Sloan Research Fellowship (2021), and the ACM SIGMOBILE Rockstar Award (2022), and his papers have won awards for best papers, demos, and highlights at premier academic venues including ACM SIGCOMM, ACM MobiCom, ACM CHI, IEEE RFID, Nature Electronics, and Nature Communications.Adib received his Bachelor's from the American University of Beirut (2011) and his Ph.D. from MIT (2016), where his thesis won the Sprowls Award for Best Doctoral Dissertation at MIT and the ACM SIGMOBILE Doctoral Dissertation Award.

Fig. 2
Fig. 2 RF-visual mapping and RF-visual extraction.a As FuseBot moves, it observes the environment using the wrist mounted camera and RF module.b Using the RF measurements, FuseBot localizes the RFID tagged items in the environment and computes RF kernels.c Using the wrist mounted camera, FuseBot observes the environment.d FuseBot fuses the vision observations and the RF kernels to create a 3D

Fig. 3
Fig. 3 RF-visual mapping.FuseBot a constructs an initial map of unknown regions using visual RGB-D information and b uses RFID tag locations to construct RF kernels.c It then combines the RF and

Fig. 4
Fig. 4 RF localization.FuseBot sends and receives RF signals (red arrows) to and from the battery-free RFID tag (in yellow) at different vantage points in order to localize the RFID tags in the environment

Fig. 5
Fig.5Confident RFID localization.FuseBot uses the heatmap of the probability (high probability in yellow, low probability in blue) to determine its confidence in an RFID location.a A highly confident localization, with a small area of yellow surrounding the RFID tag (green x).b A low confidence location, with a large area of yellow (Color figure online)

Fig. 6
Fig. 6 Spherical and Deformed RF Kernel.a A non-deformed RFID tagged object.b RFID tagged object that is deformed (compressed in the vertical direction and expanded in the horizontal direction) under a pile of objects.c The heatmap of the spherical RF kernel overlayed on

Fig. 7
Fig. 7 RF-visual extraction.a FuseBot performs depth-based object segmentation to separate different objects in the environment.b Fuse-Bot uses the 3D occupancy distribution of the target item.c FuseBot projects the occupancy distribution on each segmented mask.d Fuse-

Fig. 8
Fig. 8 Example evaluation scenarios.This shows some of the evaluation scenarios for a 1 pile, b 2 piles, and c 3 piles.The target item is fully occluded in all the scenarios

Fig. 9
Fig. 9 Impact of different parameters on performance.a This figure plots the number of actions required by both FuseBot and X-ray across three different scenarios of increasing complexity.b The figure plots the number of actions versus the number of localized RFIDs across fully occluded real-world experiments.c This figure plots the median

Laura
Dodds is a Ph.D. student at MIT, where she on wireless sensing.Her current research aims to develop novel RF sensing modalities that unlock new capabilities spanning applications in robotics, augmented reality, and supply chain.She was named an MIT Irwin Mark Jacobs and Joan Klein Jacobs Presidential Fellow in 2022 and her M.Eng.Thesis received the 2023 Charles & Jennifer Johnson MEng in AID Thesis Award.She received her M.Eng.and B.S. from MIT in 2022 and 2021.Nazish Naeem is a Ph.D. student at MIT.Her research focuses on developing novel wireless sensing technologies for applications ranging from robotic manipulation to subsea Ocean IoT.She received her M.S. from MIT in 2023 and B.S. from LUMS in 2021.

Table 4
Impact of deformable RF kernel on efficiency.

Table 5 RFID tags localization error
The table shows the 10th, 50th, and 90th percentiles of L2 norm of localization error of RFIDs in line of sight, non line of sight, and all scenarios