Vision and Tactile Robotic System to Grasp Litter in Outdoor Environments

The accumulation of litter is increasing in many places and is consequently becoming a problem that must be dealt with. In this paper, we present a manipulator robotic system to collect litter in outdoor environments. This system has three functionalities. Firstly, it uses colour images to detect and recognise litter comprising different materials. Secondly, depth data are combined with pixels of waste objects to compute a 3D location and segment three-dimensional point clouds of the litter items in the scene. The grasp in 3 Degrees of Freedom (DoFs) is then estimated for a robot arm with a gripper for the segmented cloud of each instance of waste. Finally, two tactile-based algorithms are implemented and then employed in order to provide the gripper with a sense of touch. This work uses two low-cost visual-based tactile sensors at the fingertips. One of them addresses the detection of contact (which is obtained from tactile images) between the gripper and solid waste, while another has been designed to detect slippage in order to prevent the objects grasped from falling. Our proposal was successfully tested by carrying out extensive experimentation with different objects varying in size, texture, geometry and materials in different outdoor environments (a tiled pavement, a surface of stone/soil, and grass). Our system achieved an average score of 94% for the detection and Collection Success Rate (CSR) as regards its overall performance, and of 80% for the collection of items of litter at the first attempt.


Introduction
Several environmental problems currently harm our planet, one of which is the accumulation of waste such as plastic bottles, metal cans, drink cardboard or glass that is clearly visible in the streets and parks of cities.The mean estimated degradation time for cardboard and glass ranges from 5 to 4,000 years, respectively.In order to help avoid the contamination of soil and the environment, this waste should be selectively picked up in an automated manner for its subsequent recycling.We propose to solve the problem of collecting this kind of waste outdoors by providing a robotic system that incorporates several visual-tactile perception systems.
Several solutions that use robots for cleaning purposes already exist in literature.For instance, [1] solve the aforementioned problem in a simulated indoor environment, and [2] presents a re-configurable cleaning robot that works in a real environment, but these solutions do not provide object recognition or manipulation skills that enable refuse to be picked.In this line, [3] shows an outdoor solution in which a robot detects and picks up refuse bags, but without tactile perception or variability in the scenarios.At present, robot learning techniques make it possible to address interactions with the environment during navigation and manipulation tasks, as shown in [4].Although its results are promising, this kind of approach requires a lot of training data and is usually limited to controlled settings, which are usually indoors.
The technique presented herein has been tested in real-world scenarios by performing not only detection, as occurs in [5], but also the recognition of a wider number of waste elements than occurs in [6].This is done by extracting the best grasping points of the segmented point cloud, signifying that there are fewer points to process, thus lightening the computation time when compared to that of [7].Unlike the aforementioned works, our approach also includes object handling skills thanks to new tactile perception algorithms, which make it possible to accomplish stable grasping when picking up litter.
Our main contributions are: • We propose a tactile-based grasping estimation method for the picking of waste objects.We specifically use low-cost visual-based sensors for tactile manipulation in order to carry out the task of litter collection.As these tactile sensors do not have a mathematical equation with which to map tactile images (intensity) onto force in N. and they also do not contain visual markers to estimate the movement variation, then our data-based methods are, therefore, crucial as regards performing this task correctly.• We generate two datasets, one containing tactile images for the contact and slip detection tasks, and another containing color images of household waste in a wide variety of outdoor environments, for the object detection and localisation tasks.• We present comparative studies of object recognition and contact-slip detection during the grasping task, first using our datasets and later grasping litter in outdoor environments such as in our university campus.Therefore, another contribution is based on the design, implementation and communication of the different perception modules on a real robot system applied to the task of litter collection in outdoor environments.
Our robotic system is divided into two main modules: the vision module and the tactile module.The vision module is able to detect and recognise litter from images and to calculate grasping points from the point cloud of the object.The tactile module performs grasping detection and control on the basis of tactile feedback.Our solution has been integrated and tested using an UR5e commercial robotic arm with a 2F-140 ROBOTIQ gripper.The arm was installed on our mobile robotic platform with autonomous navigation, called BLUE [8], and can be seen in Fig. 1.BLUE has several sensors to which we have added IMU and RGBD cameras (Intel® RealSense™ D435i) for detection and recognition, along with two DIGIT sensors [9] attached to the fingertips of the gripper for the tactile operations.This paper is organised as follows: Section 2 provides an explanation of related works regarding each part of the pipeline., while Section 3 shows a description of the proposed methods of which our visual-tactile perception system is composed.A description of how each of the methods is trained and validated separately is provided in Section 4, along with the preliminary results obtained after carrying out tests with previously unseen items of waste.The results obtained by the whole visual-tactile perception system in real environments are then presented, and finally, the paper concludes with a discussion of the results obtained and the performance of the proposal.

Related works 2.1 Visual perception
The rapid development of deep learning has led to an avalanche of object detection methods [10].For example, [5] used an AlexNet Convolutional Neural Network (CNN) to perform image classification tasks in both indoor and outdoor environments.However, this work presents solely a detection system for household waste rather than a system with which to pick waste up.
Other works attempt to solve more complex problems, such as object detection and segmentation.In [11], the author uses two Neural Networks (NNs) to eliminate the ground and classify six kinds of waste items.We, however, use a single NN to locate and classify the objects, thus speeding up the process.Going one step further, [12] trained a YOLACT NN to classify items of domestic waste in indoor environments.They tested the model with only three categories ("plastic", "metal" and "cardboard").We, in contrast, add a new class of glass objects, thus making the detection more difficult owing to their transparency.Moreover, our system works in outdoor environments in which lighting conditions cannot be controlled.
The waste collection task makes it necessary to address not only its detection and location but also the grasping of objects [13], as already implemented in the robotic field.For instance, [7] proposed a mathematical method with which to calculate the best pair of grasping points from a 3D-scene point cloud.This method extracts the object and obtains the grasping pose using the curvature of the object.In our work, we use a segmented point cloud of the 3D-object for the grasping task.Reducing the number of candidate points makes the process faster.Another NN approach, which is shown in [14], generates grasping proposals from voxelised 3D-object point clouds.But it has some limitations as regards complete outdoor point clouds rather than segmented object point clouds and working in real time.Our proposal is an improvement in this respect, since it is faster in outdoor scenarios.

Tactile perception
Several types of tactile sensors (capacitive, resistive, barometric, optical-based, etc.) have been developed in recent years in order to assist in robotic manipulation tasks such as contact or slip detection, which are essential for a safe grasp when picking up objects.In previous works, [15] calculated contact by comparing the intensity of the colour of two tactile images using an optical tactile sensor called Gelsight [16].However, this method requires the readjustment of its parameters and is less robust to uncertainty owing to the use of traditional computer vision techniques.In [17], the authors trained a Recurrent Neural Network (RNN) to detect contact events in image sequences from another optical tactile sensor called FingerVision [18].Nonetheless, this sensor contains markers that are complex to generate and require high-cost machinery.The DIGIT sensors do not contain markers, while our contact detection CNN-based method is more robust than traditional computer vision techniques.
When employed in the literature related to this field, the term slip usually refers to the normal component of the grasping force going outside of the grasping cone.Detecting and reacting to this phenomenon is, therefore, fundamental if an object is to be grasped correctly.In [19] a Support Vector Machine (SVM) algorithm is trained in order to classify whether images obtained from a TacTip sensor are stable or slipping events.However, they do not grasp the object with a gripper or hand, but rather push the object against the wall in order to stop the slip movement.This application is, therefore, limited.The slip or stable classification task has also been studied in [20], in which visual (eye-to-hand camera) and tactile (Gelsight sensor) information is combined.However, it is necessary to process a lot of data, and real-time execution is not guaranteed.Conversely, our slip detector works extremely fast and allows the implementation of our controller.In a major advance, [21] researched the slip detection problem using a multi-fingered hand.Nonetheless, detecting slip is more complex as the number of sensors increase.Our approach of using a two-fingered robotic gripper is, therefore, more optimal for the litter collection task.
3 Our approach: Visual-tactile perception for robotic manipulation In this work, we propose a visual-tactile perception system for robotic manipulation, whose architecture is shown in Fig. 2.Only the six DoFs manipulation arm of the mobile manipulator robotic system described in Section 1 is employed for the task of litter collection, assuming that the navigation task towards the litter objects is already carried out, thus obtaining an optimal localization of the mobile platform.Detailed descriptions of the three main modules of our system are provided in the following sections.

Object detection and localisation
In this section, the first module is explained, which consists of detecting and locating the item of litter using a CNN.The first task carried out by our system is detecting and classifying litter in outdoor environments.In order to accomplish it, our BLUE robot performs the exploration task using several navigation algorithms.While the exploration is taking place, BLUE is able to annotate objects as possible instances of refuse and compute their spatial location in the world [22], thus making it possible to plan trajectories towards them.The visual sub-system, which is shown in the green scheme in Fig. 2, captures images while the robot is navigating and then Fig. 2: Scheme of our tactile-visual system for robotic grasping.It is made up of two main parts: waste detection and recognition (green part) and tactile perception in order to manipulate the item of waste (blue part).
processes them using a CNN to obtain the object position and category of the objects.After analysing several CNNs available in the state of the art [23], [24], we chose Mask R-CNN [25], YOLACT [26] and YOLACT++ [27].These CNNs appear to work well in a wide variety of fields that require segmentation tasks, such as understanding natural scenes or intelligent driving.In the former, it improves object detection since it avoids some cases of occlusion by providing a more detailed analysis of the image, while in the latter it is used to determine the localisation of major categories of objects such as street lights or people running.
On the one hand, Mask R-CNN is considered to be a two-stage detector, since it has two parts.The first generates Regions of Interest (RoIs), and the second classifies and segments those RoIs.These detectors have drawbacks, such as low performance and a dependency on feature localisation.Its architecture is similar to that of the Faster-RCNN [28], since it adds a new layer to the Faster-RCNN in order to predict a segmented mask.In this case, there are three output layers: the class label, the Bounding Box (BB), and the aforementioned segmented mask.These changes comprise an improved RoI Polling layer (RoI Align) and the addition of a segmented mask output layer in Mask R-CNN (see Fig. 3).Fig. 3: Mask R-CNN architecture with an example of the input and output from the litter detection task On the other hand, YOLACT is considered to be a single-stage detector that performs the instance segmentation in one step.This makes it really fast and allows it to achieve real-time inference.The detection task is divided into two simpler parallel sub-tasks.The first consists of the prototype generation branch, which predicts a k-set of prototype masks without loss (this loss exists in the traditional methods).The second involves the mask coefficient branch, which is a vector of mask coefficients for each prototype.These result in kmask coefficients (one per prototype), c class confidences, and 4 BB regressors, producing a total of 4 + c + k coefficients per anchor.Finally, the two subtasks come together in the mask assembly process.This is done by applying a linear combination to both sub-tasks and a sigmoid non-linearity process (see Fig. 4).YOLACT++ was created later by making some minor changes to YOLACT.These changes range from a fast mask re-scoring network to deformable convolution in layers, including an optimised prediction head.
One of the configurable parameters in both CNNs is the backbone.Our visual sub-system is implemented in order to choose between the ResNet50-FPN and the ResNet101-FPN.Mask R-CNN includes the DarkNet53-FPN as an extra backbone.The number that accompanies each backbone determines the number of layers.In fact, all of these CNNs include a Feature Pyramid Network (FPN) as part of their architecture.ResNet [29] stands for Residual Neural Network.In 2015, it was still believed that adding additional layers to a NN would make it work better.This worked in theory but not in practice since there was the problem of the vanishing gradient.This new architecture solved that problem by incorporating residual blocks with skip connections.
The FPN consists of reducing the size of an image step by step.Image features are then extracted from both the original and the scaled images in each step.FPN later combines all the features that have been extracted, mixing both low-resolution semantically strong and high-resolution semantically weak features.This combination can be achieved by using top-down and lateral connections, thus leading to better results in outdoor image analysis.
Another backbone, which is written in C and CUDA, is DarkNet [30].This rose to fame because it improved the performance of ResNet101-FPN and carried out the detection process 1.5x faster.
In this work, we shall analyse the behaviour and performance of these architectures as backbones in detection tasks for robotic manipulation in real outdoor environments.We first present the validation results, which were obtained after running our system offline.That is, we used pre-recorded videos of previous navigation missions that had already been carried out (see Section 4.2).Finally, we show additional results in new scenarios.These have not been seen by our NN before and were captured in real-time navigation mode (see Section 4.5).

Grasping computing and trajectory planning
This section describes how our system estimates the grasping points from the 3D point cloud to collect the item of litter.
Once the object has been recognised and is in the robot's reachable workspace, our manipulator robot has to pick it up.It is, therefore, necessary to estimate the grasp.In our case, we obtain grasping points by using a method called GeoGrasp.It is based on geometry and needs a raw scene point cloud as input.But in this work, the input has been changed, as inspired by [31].A filtered point cloud is, therefore, employed.
An image is initially captured using a RealSense™ D435i depth camera.The depth image has a resolution of 640x480 pixels, which matches that of the RGB image.The RGB image is then processed by Mask R-CNN, YOLACT or YOLACT++.The detection task results in a cluster of pixels containing the object, a BB, and a category.This result allows the creation of a new point cloud that includes only the RGBD points considered by the NN as object points.Then, GeoGrasp calculates the grasping points in the new point cloud and locates these points in the 3D space referenced at the base of the robot.Two values are required for this task: the transformation from the proposed grasping points to the camera and the transformation from the camera to the robot base.
The first transformation is calculated using the proposed grasping points and the camera intrinsic parameters as in Eq. ( 1).
(1) where [x G , y G ] denote the coordinates of the proposed grasping point in a 2D image for the x and y axes respectively, (f x , f y ) and (c x , c y ) respectively denote the focal length and the camera center (obtained from the intrinsic parameters of the camera) in pixels, R t denotes a consecutive rotation of -90 degrees in the x and y axes respectively, while M c denotes the calibration matrix (obtained by using ArUco markers [32]) and d denotes the depth of the point in mm (obtained from the D channel of the RGBD image captured).
The second transformation consists of transforming [x C , y C , z C ] into the coordinates of the base of the robot by following Eq.( 2).
where [x C , y C , z C ] denote the 3D coordinates relative to the camera, [x B , y B , z B ] denote the new coordinates relative to the base in meters, q ∈ ℜ 6 denotes a vector of joint coordinates, p(q) ∈ ℜ 3 denotes the position vector from the robot´s base to the coordinate frame (forward kinematics), R(q) ∈ SO(3) is the end-effector orientation and E T C is the fixed transformation between the camera and the robot end-effector.These coordinate systems can be seen on the right-hand side of Fig. 1.
One drawback of using only the segmented object points is that we have to trust the NN detection since our calculation is based on its result.The worse the NN detection is, the worse the grasping points will be positioned.The full process is shown in Fig. 5. Fig. 5: Scheme of our object detection and grasping points calculation process.Our NNs obtain the segmented mask from the RGB image, which is used to extract the object point cloud in order to calculate the grasping points with our new version of the GeoGrasp algorithm Once the objects have been detected and localised, and their grasping estimated, the robotic arm mounted on the BLUE robot has to reach them.This is done using ROS Moveit! [33].Moveit!provides the UR5e arm with a motion planning framework in order to compute and test the trajectories before operating the real physical robot.A wide variety of motion planning controllers are provided, but we use RRT* [34], [35] (an asymptotically optimal version of Rapidly-exploring Random Trees).
It is worth mentioning that some of the trajectories computed will always be the same.This has been borne in mind, and some of them have, therefore, been prerecorded.These are from the navigation pose (see Fig. 1-bottomleft) to the home pose (see Fig. 1-left), the home pose to the detection pose (see Fig. 1-right) and the home pose to each of the container poses, and vice versa.In necessary, we simply need to play them.That will allow the robot to save computing time and avoid undesired or dangerous movements.The aforementioned trajectories are shown in Fig. 1-left.The remaining movements are planned in situ (blue paths -see Fig. 1-right).The green working area has dimensions of 600x500 mm.All objects in this area are easily graspable and easy for the camera to see.
Four containers are attached to the base of BLUE.This is the same as the number of categories used to classify objects in Section 3.1.Each class of the dataset will, therefore, be stored in different containers.
Finally, all positions have pre-positions.These are used to avoid pushing, moving, or colliding with the objects.There is a downward movement of ≈ 100 mm from the pre-position to the final position.This value was obtained empirically from the experiments and was that which obtained the best results.

Tactile manipulation
This module consists of detecting physical contact and slippage between the items of litter and the gripper during the manipulation task.This task involves picking up the item of litter from the ground and placing it in the desired bin, given the previously calculated trajectory and grasping points.In order to ensure grasping safety, the contact is detected before the robot lifts the object.Once the contact has been guaranteed, the robot will be able to adjust the grasping opening if a slip is detected.These operations cannot be carried out without tactile feedback in real time.We, therefore, implement a closed-loop controller for each task as shown in the blue scheme in Fig. 2.
It is known in the literature that force sensors are suitable for this kind of task because they allow the implementation of force-feedback controllers.Nonetheless, in this paper, the objective is to demonstrate that grasping and slipping controllers can be implemented in order to successfully solve the task by using optical tactile sensors that do not provide force values, which are known as DIGITs.These tactile sensors are more economical, smaller and provide more information about the features of the object such as texture or shape.
The DIGIT sensors provide tactile images and are used in order to implement the tactile feedback and closed-loop controllers.These sensors, which were originally designed in [9], contain a physical structure that is made up of: a reflective elastomer, an acrylic window, a 3D printed housing, a LED Printed Circuit Board (PCB) and a camera PCB.The sensor operates by recording the change in colours as a result of the deformation of the elastomer during the contact state.DIGIT sensors capture up to 30 Frames per Second (FPS) of RGB images with a resolution of 240x320 pixels.We mounted one DIGIT sensor (see Fig. 6) on each fingertip of the 2F-140 ROBOTIQ gripper.An additional 3D printed piece was designed and built so as to attach each sensor to each fingertip.As the main contribution of this paper, we use touch images obtained from DIGIT sensors associated to contact properties to perform a tactile control.We design controllers to regulate the contact of a gripper to interact with objects without mapping the force applied to the sensing surface.The force values cannot be reconstructed because this low-cost image-based sensor does not provide a pixel-to-force mapping as other optical sensors [36] or capacitive/resistive sensors [37].The novelty consists in constructing a version of a tactile controller for unknown environments without using the tactile Jacobian and the features of the forces as in other works as [38].

Contact detection
Contact detection is formulated as a binary classification task as described in Algorithm 1.Given an input image φ = [R, G, B] acquired from the DIGIT sensor, a ground truth label y ∈ [0, 1] is assigned to no contact and contact images (see Fig. 7), respectively.Bearing this in mind, we use a CNN to solve this task.This method was chosen owing to its ability to learn and extract features from images, such as edges and simple textures, to more complex textures, patterns and parts of objects.In our previous work [39], we carried out extensive and rigorous experimentation with CNN architectures in order to discover which would be the most suitable for the purpose of contact prediction.We trained three different architectures: VGG16 [40], InceptionV3 [41] and MobileNetV2 [42] following a transfer learning strategy.Our dataset contained ≈ 16.000 images, which were manually annotated by comparing a contact image with a no-contact image reference, from three DIGIT units and nine objects with different shapes and textures.This work showed that InceptionV3 was the most appropriate architecture in terms of accuracy, robustness, and inference time.However, in the present work, it is necessary to reduce the size of its architecture in order to speed up the inference process in our embedded system for the robot.This reduction led to the decision to train one NN (θ sensor unit contact ) can be unit A or B, thus, improving performance and evaluation values.Specifically, we used the InceptionV3 backbone up to the "mixed5" layer as a feature extractor and modified the final layers to adapt them to our task.The final layers are made up of a GlobalAveragePooling2D layer and two blocks of Batch Normalisation, Dense, and Dropout layers.Finally, the output layer was added with a single neuron and a sigmoid activation function with a threshold T sensor contact (Fig. 8).In order to execute the desired manipulation tasks, a closed-loop controller is implemented, as shown in Algorithm 1. First, it checks whether the pose of the robot is that of grasping or releasing the item of litter.It then executes the contact prediction model to obtain the contact state for each sensor.When carrying out the subsequent grasping task, the robot closes the gripper by one step each time that the contact prediction is equal to 0, signifying that the item of litter has not yet been grasped.During the releasing task, however, the robot opens the gripper by one step each time that the contact prediction is equal to 1, signifying that the item of litter has not yet been released.The execution of Algorithm 1 ends when the contact prediction is equal to 1 (grasping task) or 0 (release task).

Slip detection
Various approaches with which to solve the slip detection task by classifying image sequences have been proposed in the literature (see Section 2.2).In contrast, this paper presents a grasping method based on slip detection as described in Algorithms 2 and 3. Our proposal formulates the stage of slip detection as an image classification problem between two classes: slip and stable.The slip class refers to object movement during robot manipulation, while the stable class implies no movement.
The slip detection (Algorithm 2) takes as input a grayscale image sequence Φ = [δ t , δ t+1 , δ t+2 , δ t+3 ], whose length is empirically calculated in order to attain the best results in terms of accuracy and inference time and where if robot in pose = 1 then ▷ Grasp or release pose end if 23: until done = 1 is applied to obtain the changes in the deformation of the elastomer.The subtracted image ψ is very noisy because the pixel values are not identical in consecutive images.ψ is, therefore, filtered by applying an opening morphological operation Ψ = ψ •κ = (ψ ⊖κ)⊕κ, where •, ⊖, ⊕, and κ denote opening, erosion, dilation and a structuring element, respectively.Ψ is a binary image (black and white) that represents two possible states or labels y ∈ [0, 1].Label y = 0 is assigned to the stable class and label y = 1 is assigned to the slip class.Figure 9 shows examples of this pre-processing in which slip images produce white patterns (pixel value of 255), and stable images are almost black (pixel value of 0).The subtraction ψ and filtering Ψ operations correspond to the f ilterImage function in Algorithm 2.
Once Ψ has been obtained, it can be classified as appertaining to the slip or stable class by using two different approaches and thresholds, which are compared and justified in Section 4.4.The first consists of calculating the brightness of the Ψ image, in which a threshold value of T brightness slip is established as the final classifier.The second method is a CNN (CN N P rediction in Algorithm 2) whose architecture is described in Fig. 10, followed by a threshold value T cnn slip as the final classifier.Detailed descriptions of both methods are provided in Algorithms 2 and 3.The robot later closes the gripper by one step

Setup, performance metrics, and data
This section provides descriptions of first the hardware devices used to train and test the NN, second the performance metrics used to express our results, and finally, general information regarding our datasets.
The first device is an NVIDIA A100 Tensor Core GPU with 40 GB of RAM memory.This device was used to train the visual and tactile perception modules.The other is an NVIDIA Jetson AGX Xavier board.This device was used to test both modules and for real-time execution.robot in release pose ← checkRobotInReleaseP ose() end if 11: until robot in release pose = 1 With regard to the evaluation metrics used, we have tested our visual system with an AP metric [43], [44] and our tactile system with an accuracy metric [45].These metrics are well-established in literature and expressed our results in a complete and reliable manner.
The AP metric, which is described in Eq. ( 3), is calculated for different IoU thresholds, described in Eq. (4).
where r, ρ, r, gt, and pd denote levels of recall, precision, and recall values, the ground truth, and the prediction bounding boxes.All these values are calculated to measure the performance of the NN models of our pipelines.
The accuracy metric described in Eq. ( 5) can be used for tactile testing because the tactile datasets are well-balanced.
where T P , T N , F P , and F N ∈ N ≥0 and denote True Positives, True Negatives, False Positives, and False Negatives.A T P detection means that the system detects contact or slip and this is correct, while a T N detection means that the system detects no contact or no slip and this is also correct.F P and F N detections occur when the system detects that there is contact or slip but this is not correct or when the system does not detect a contact or slip state but it exists, respectively.
We created three datasets for each specific task that required a training phase: vision-based waste detection (D1), tactile-based contact detection (D2), and tactile-based slip detection (D3).For the visual module, we used 52 different household objects for the four classes (plastic, cardboard, glass, and metal), while for the tactile tasks we used only eight and six objects, for contact and slip detection, respectively.The visual module requires more objects for the training phase in order to learn a large variety of shapes, colours, etc., in different outdoor scenarios.Moreover, with regard to the tactile manipulation modules, different objects may produce similar tactile images because they share similar shapes and geometries, signifying that the tactile datasets do not require such a high number of objects.
Table 1 shows that the D2 dataset contains more samples because it is easier to collect and label tactile images than household waste images in different environments.The D3 samples are also tactile images, but as explained earlier, a sequence of four gray-scale images is transformed into a single binary image in order to detect slip, and the final number of images is, therefore, smaller than in the rest.

Data collection and training for visual perception
This section describes the training dataset, the training phase, and the results both with validation and test sets.As there are not many household waste datasets containing objects that are dirty or partially destroyed, we had to create one ourselves (dataset D1).Images taken in this kind of environment were used for the training task.These environments are all at the Technological Scientific Park, in the area around our university, and include asphalt, pebbles, and green backgrounds.The objects in them are closer and further away, in addition to being partially occluded or shadowed by other elements in the scene and having different lighting conditions (see Fig. 11).In fact, each class represents a broad range of elements made of each material (see Table 2).There are differently posed and sized water bottles, drinking bottles used by practitioners of sports, Tupperware containers, cans, glass beer bottles, or juice cardboard, among others.This way of naming the classes is common to other datasets and will facilitate the addition of new images from other sources if necessary.These images were extracted from video files.The video was recorded at a resolution of 640x480 pixels, using a RealSense™ D435i depth camera.As there is only one object per image, the number of images per class and per instance coincide.After processing the videos, 6,943 images were obtained to compose the dataset D1.These images were labeled with LabelMe [46], an image annotation tool.As will be noted, all the classes are well balanced (see Table 2).For our experimentation, we split the dataset D1 into training, validation, and test sets.This was done by following the 70/20/10 proportion.The division is made randomly to guarantee that results are not dependent on how data is distributed and picked.Indeed, objects in the test set have not been used during training or validation phases before.
The learning process of the NNs has the following methodology.For Mask R-CNN and all its derivates, the training was split into three sub-training tasks.The first consisted of training the network heads for 40 epochs.The second lasted 80 epochs and trained ResNet backbone stage four and upwards.The last involved training the full NN for 40 more epochs.In the first two substages, the learning rate did not change, while in the last one, it was reduced to 10 times its original value.We used 2 images per GPU and 1,000 steps per epoch, signifying that a random subset of 2,000 images was used for training in each epoch.The learning rate was set to 0.001 using the Stochastic Gradient Descent (SGD) optimizer.
Only one training task was applied to both the second and the third set.This consisted of training the complete NN for 160 epochs.The learning rate was reduced by 2 × 10 −7 during the first two and a half epochs, from 0.0011 to 0.0010.There were also 26 images per GPU and 216 steps per epoch, signifying that each image in the training dataset was used per epoch.With regard to the optimizer, SGD was again used.The following results were obtained after the training process had been carried out (see Tables 3 and 4).These results are expressed using AP 50 , AP 75 , and AP 90 as metrics (AP values between 0 and 1).During the training process, YOLACT++ with ResNet50 obtained the best results with the validation dataset.It achieved 99.8%, 98.5% and 76.5% in AP 50 , AP 75 and AP 90 respectively.The first two metrics did not help much as regards choosing an algorithm, but the last allowed us to choose YOLACT as the best method.After using the test dataset, the final results were obtained and they had a similar trend.The best model was again YOLACT++ with ResNet101 in both AP 50 and AP 90 , achieving a score of 99.9% and 74.7% respectively, while in AP 75 the best model was YOLACT with DarkNet53, achieving an average precision of 98%.Since we have two different NNs with bigger and smaller backbones, inference time is also a useful tool for the selection of the best NN model.This will help to determine which combination is the fastest in the inference task.The results are shown in Table 4.As will be observed, the fastest model is the combination of YOLACT with DarkNet53, achieving 17.2 ms.There is a huge step between the YOLACT and Mask R-CNN models, with YOLACT being an average of more than 50 ms faster.
All of these results led to the decision that the NN chosen for the process would be YOLACT with DarkNet53 as a backbone.We chose this NN since we prioritise speed (it is 57% faster when compared to that with the best AP 90 ) over accuracy and also because AP 75 is sufficiently good for our process.The accuracy of the NN per class in our dataset is shown in Table 5.We also show some outdoor recognition examples when our visual model is used with unknown samples from the four classes.These are provided in Fig. 12.The following confusion matrix was obtained when using the NN chosen (see Fig. 13), in which the most confusion occurs between the plastic and glass classes.The tactile data was collected by performing consecutive maneuvers and recording the images from the DIGIT sensors mounted on the gripper.A human operator changed the pose of the objects for each robotic grip.

Tactile data collection and training for contact detection
Eight objects were used to create the dataset D2 = [φ 1 , φ 2 , φ 3 , ..., φ n ] where n is the total number of images.Approximately the same number of images was obtained for each object.Eight objects were sufficient to form the dataset because they differed in terms of size, shape, deformation, weight, texture, and material (see Fig. 14).We discarded objects that are completely deformable or very narrow such as plastic bags or cardboard sheets due to the limitations of the fingertips and tactile sensing area in size.
We train our CNN (see Section 3.3.1)with each sensor, units A and B, because the images extracted from DIGIT sensors are not identical (see Fig. 7).We, therefore, achieved better results with individual models rather than a single model for all the sensors.Two datasets [D2 A , D2 B ] were, therefore, designed (see Table 6), and each dataset was split into three subsets: training  A transfer learning strategy was applied in order to carry out the training phase.The layers up to the "mixed5" layer were, therefore, set as being non-trainable.The remaining layers were set as trainable, including the head.A Root Mean Squared Propagation (RMSProp) optimizer was used with a learning rate of 3 × 10 −6 , a batch size of 24, and a binary cross entropy loss.Once both models had been trained, the evaluation process was executed on the test dataset.The results are expressed in terms of the accuracy metric, previously described in Eq. ( 5).Table 7 shows accuracy values for both the sensors and the three different thresholds (T sensor unit contact ).Although these values are very similar, T sensor unit contact = 0.5 was chosen in order to prevent wrong detections resulting from the hysteresis of DIGIT sensors.This hysteresis is produced when the elastomer is recovering its initial shape after the contact.

Tactile data collection and training for slip detection
This section describes the tactile dataset D3 for the slip detection task, the training or tuning phase, and the results with the test set.
where n is the total number of images.As slip detection is a different task to contact detection, it was necessary to create this new dataset in order to capture the corresponding tactile images generated when the objects slip.The slip class images were generated by applying three external instabilities to the object, while the stable class images were regular images with no disturbances.As noted in Fig. 15, one rotational and two translational movements were applied to each object.The goal was to detect not only slippage, but also any other possible perturbations during the robot manipulation.These perturbations could result in the object falling to the ground, thus, generating a collection failure.Fig. 15: Translational (red and green) and rotational (blue) perturbations produced by a human operator.These movements include all perturbations that the object could undergo during the manipulation task Six objects (b, c, e, f , g, and h in Fig. 14) were selected from the set of eight used for tactile detection.They were sufficiently varied to be able to form the training dataset, which was divided into two subsets: training (70%) and validation (30%) (see Table 8).Objects a and d were discarded from D3 in order to avoid repeating cylindrical shapes.The number of images obtained for the six objects was roughly balanced and the images were different from those in the contact detection dataset.
Three novel objects were also added to D3 for testing (see Fig. 16).In this way, we can evaluate the generalisation capabilities of the proposed algorithms with previously unseen objects.No training process was required in order to calculate the brightness of an image.The CNN did, however, require this training process.The low number of parameters allowed us to train the full network from scratch.An Adam optimizer [47] was used with a learning rate of 1 × 10 −4 , a batch size of 32, and a binary cross entropy loss.
After the training process had been completed, the performance of the model was calculated with the test set.This set was different from the training and validation sets because it was not made up of shuffled images with their respective labels.The test set contained three continuous videos, one per object, to which we applied these three external instabilities, five times each.This signifies that there are 15 instabilities per object, and 45 in total per sensor.
The aim of this experiment was for the models to detect every case of instability as a slip class without detecting false positives.The results are expressed in terms of instability-detection accuracy and inference time in our embedded system.The accuracy values are between 0 and 1 and the total time in each timestamp is calculated by adding the inference time of each sensor.
Tables 9 and 10 show that detecting slippage by calculating the brightness of Ψ images is more accurate and faster than using a CNN to classify these type of images.This is caused because the f ilterImage function (from Algorithm 2) converts the RGB tactile images into binary images, from which the CNN has more difficulties to extract features.
The T brightness slip threshold values of five and ten achieve the same accuracy value, but with T brightness slip = 5, the model frequently detects false positives.
We consequently set the brightness method with T brightness slip = 10 as the slip detector.

Detection and manipulation results in outdoor
In this section, we show the experimentation that we carried out in order to test our system in three different outdoor environments with different objects with respect to the previous sections.Finally, we show the promising and reliable results obtained from this experimentation which prove that our system is able to perform the litter collection task.However, prior to running certain tasks in real mode, it is recommendable to simulate them in similar conditions.We specifically simulated the trajectory planning so as to visualise and check the robot arm and the gripper movement that is necessary in order to accomplish the robotic grasping task.We did this by building our mobile manipulation robot using ROS and RVIZ.Our simulation comprised the BLUE robot, the UR5e robot arm with the gripper and the RGBD camera mounted on the end effector, along with DIGIT sensors mounted on the fingertips of the gripper, as shown in Fig. 17.
Fig. 17: Robotic system that allows tasks to be performed in simulated environments To complete the outdoor experiment, we decided to use four new objects, one from each of the classes: cardboard, plastic, metal, or glass.These objects had not been used to form any of the previous visual or tactile datasets and were employed with the objective of testing the generalisation capabilities of our system with unknown objects for the waste collection task.Testing our system in different environments brought this experiment closer to reality.We, therefore, chose three new environments that were not been included in our previous datasets in order to test the generalisation of our proposal in unforeseen situations (see Fig. 18).The following images show the results obtained for the YOLACT detection and grasping points (Section 3.2).Once the coordinates of these points have been referenced to the camera, it is necessary to obtain their global position.This position will be obtained from the robot base.Some of these coordinates are shown in Fig. 19.Before carrying out the main outdoor experiment, two sub-experiments were carried out in order to establish the variables of the tactile manipulation module.The goal of the first sub-experiment was to demonstrate that sending a closing command through ROS to the gripper when slippage was detected would make the grasping more stable.In the second, we established a variable to denote the number of contact detections predicted by the model, in order to consider that the object had been grasped.Finally, in the main outdoor experiment, we evaluated our system by running our pipeline five times per object in each environment, namely, 60 object pick ups.
In order to demonstrate that grasping required compensation, we filled one object with a small quantity of water.The idea was to perform the grasping task, lift the object and evaluate the slip detections with and without grasping compensation.Figures 20 and 21 show that when grasping compensation is applied, only one slip event is detected and the object does not fall.We did not use any compensation algorithm that the software of the gripper has, but instead, we created our own compensation algorithm.This number is smaller when applying grasping compensation In order to lift the object, our system first needs to know whether or not the object is being grasped.An object is considered to be grasped when a certain number of contact detections are completed.We, therefore, ran the entire perception and tactile system with each one of the four objects, using different numbers of contact detections.The decision as to which number was the most suitable for each object was made by calculating the same graphics as those shown in Fig. 20. Figure 22 shows that a stable grasp can be achieved for the four objects with a threshold of 3 contact detections for the objects of cardboard, plastic and metal, and 4 contact detections for the glass object because they usually weigh more.
Once the parameters had been established, the final experiment could be carried out.The results are expressed as the accuracy value that denotes the percentage of successful graspings, which we denominate as the Collection Success Rate (CSR).Fig. 22: Slip detection graphic for each object.These plots show that slip is produced only once after compensating the grasping opening A grasping is considered successful when it achieves all of the following conditions: the object is segmented correctly, the grasping points are located on the object, the contact is detected and the robot is able to lift and move the object to the corresponding box, without dropping it.The success of collecting an object at the first attempt is 80%.However, a failed first attempt can be successful in the second one since grasps are independent probabilistic events.That is, the pipeline is launched again if needed and the object could have changed its position when falling to the floor on the previous attempt.
As can be seen in Table 11, the overall collecting success rate varied from 75% in the worst scenario to 85% in the best one.Our system works better on flat surfaces (tiled pavements) or on surfaces that the gripper can cross in order to grasp the object (grass).Nonetheless, the system also attains promising results on irregular surfaces on which the gripper could collide with the ground.Good results are also obtained when using the same coloured object and environment (metal object and grass environment).With regard to the results in terms of object class (see Table 12), our system attained a CSR of 60% in the worst case and 93% in the best one, when picking up the objects at the first attempt.The drop in the CSR that occurred with glass objects was owing to their transparency, which makes object segmentation and grasping point calculation more difficult.The grasping pose is not, therefore, precise and the gripper fails when picking up the object.Figure 23 shows each step of an example of one litter collection attempt with a cardboard object in a tiled pavement environment.In this example our system detects and locates the item of litter as cardboard class, calculates the grasping pose correctly, and performs the manipulation tasks in order to collect the litter.
The most common errors are produced as a result of the wrong locations of the grasping points, failed contact detections, low quality of the point clouds produced by the RGBD camera, or wrong object detection and segmentation.
Our system failed at the first attempt on 18% of occasions.Figure 24-a shows the distribution of error types that led to the collection failure.Figure 24-b shows the CSR of each module independently.In this work, we present a robotic system that is able to collect household objects in outdoor and natural environments.This system was created by implementing three main modules: litter detection, grasping point calculation, and tactile manipulation.These three modules together allow our system to obtain promising and reliable results with unknown objects and in outdoor scenarios.
We have implemented two detection and control methods: one for contact detection, which is based on CNN, in order to know when the object is being grasped, and another for the slip detection, which is based on morphological operations, in order to readjust the gripper opening.Moreover, we have used well-known neural networks for the recognition and segmentation of objects.The result is then combined with 3D techniques to extract the surface features of those objects.Also, we have carried out extensive and rigorous experimentation in order to adjust our visual model to different objects in different scenarios.We first carried out offline experimentation (Section 4.2, 4.3 and 4.4) with real recorded data, after which we tested our system in online mode with real-time data (Section 4.5).When all the modules work together, our system is able to perform the complex task of picking up household objects in outdoor scenarios by combining the information concerning each module.This is assuming that the biggest object that the system could handle is 118 mm, in which we could manage a rotation error up to 13º with respect to the x C reference frame system of Fig. 1 -right and a translation error of 3% with respect to the size of the object with respect to the y C and z C reference frame systems of Fig. 1 -right (approx 3.5 mm).Overall, our system performs correctly when collecting unknown objects in new environments, obtaining high accuracy when making more than one attempt.Nonetheless, our system has some limitations that we present as future lines of work.For example, transparent objects cause the detection and grasping point calculation modules several problems resulting from a bad quality point cloud obtained by the RGBD camera.This issue could be solved by a complementary object reconstruction process.Another limitation is the detection of dynamic objects, namely, an object whose position or orientation changes while the detection is taking place, for example, due to the wind.We could solve this by performing a tracking operation in order to update the object pose in real time.We are currently resolving this issue by executing the waste detection in a loop in order to update the object's pose.Supplementary information.Authors reporting more examples of robotic grasps for other litter objects in Video1 and Video2.

Fig. 1 :
Fig. 1: Home, navigation and detection poses of BLUE robot.(Left) UR5e pose when BLUE is in home pose.(Bottom-left) Navigation pose when BLUE is moving around.(Right) Detection pose when BLUE is near an item of litter

Fig. 4 :
Fig. 4: YOLACT architecture with an example of the input and output from the litter detection task

Fig. 7 :
Fig. 7: Contact (a,c) and no contact (b,d) image sensor from DIGIT units A (a,b) and B (c,d).The images from both sensors are not identical because both sensors were manufactured by hand

Fig. 8 :
Fig. 8: Architecture of our θ sensor unit contact .It is based on the well-known Incep-tionV3 architecture, although the final layers have been modified

Fig. 9 :
Fig. 9: Example of pre-processing steps employed to obtain the final image before the classifying process.

Fig. 10 :
Fig. 10: Architecture of CN N P rediction, which is based on a simple CNN architecture for image classification

Fig. 11 :
Fig. 11: Variability environments in our dataset.They show different visual and physical features

Fig. 12 :Fig. 13 :
Fig. 12: Examples of outdoor litter detection.These examples contain one object from each class (cardboard, plastic, metal, and glass) in different environments

Fig. 14 :
Fig.14: Objects used to create the dataset D2, which was used to train, validate and test the contact detection method and also to train the slip detection CNN

Fig. 16 :
Fig. 16: Objects in our test set.These objects differ from those in the training set in terms of size, textures, deformability, etc

Fig. 18 :
Fig. 18: Objects (one from each class) and environments (3 different environments) used to test our robot in real outdoor experiments

Fig. 20 :
Fig. 20: Number of slip detections while the robot arm is lifting the object.This number is smaller when applying grasping compensation

Fig. 21 :
Fig. 21: (a, b, c) Object falling without grasping compensation.Slip state is produced and detected (unstable).(d, e, f) Object does not fall with grasping compensation.Slip state is detected and compensated by closing the gripper

Fig. 23 :Fig. 24 :
Fig. 23: Example of picking up a cardboard object in a tiled pavement environment

Yes No Household waste type
Algorithm 2 Slip-Detection algorithm 1: function slipDetection(image sensor sequence, method)

Table 1 :
Number of total samples of each dataset for all classes, sets, and sensors

Table 2 :
Distribution of dataset D1

Table 3 :
Results of validation dataset using AP

Table 4 :
Results of test dataset using AP and inference time.The inference time is calculated as the mean average of every image detected in the test set

Table 5 :
Results of test dataset using AP as metrics.These results are calculated per object class and using YOLACT NN with DarkNet53 as a backbone

Table 6 :
Distribution of images per sensor unit [D2 A , D2 B ] considering

Table 7 :
Accuracy (Acc) values for sensors A and B, and T sensor unit contact = 0.4, 0.5, 0.6.Accuracy values are between 0 and 1

Table 9 :
Results obtained using CNN method for sensors A and B, and T cnn

Table 10 :
Results obtained using brightness method for sensors A and B, and

Table 11 :
CSR values obtained splitting the results in terms of the environment (CSR-Env).These values are between 0 and 1

Table 12 :
CSR values obtained after splitting the results in terms of the object class (CSR-Obj).These values are between 0 and 1