Visual Servoing of Unknown Objects for Family Service Robots

Aiming at satisfying the increasing demand of family service robots for housework, this paper proposes a robot visual servoing scheme based on the randomized trees to complete the visual servoing task of unknown objects in natural scenes. Here, “unknown” means that there is no prior information on object models, such as template or database of the object. Firstly, an object to be manipulated is randomly selected by user prior to the visual servoing task execution. Then, the raw image information about the object can be obtained and used to train a randomized tree classifier online. Secondly, the current image features can be computed using the well-trained classifier. Finally, the visual controller can be designed according to the error of image feature, which is defined as the difference between the desired image features and current image features. Five visual positioning of unknown objects experiments, including 2D rigid object and 3D non-rigid object, are conducted on a MOTOMAN-SV3X six degree-of-freedom (DOF) manipulator robot. Experimental results show that the proposed scheme can effectively position an unknown object in complex natural scenes, such as occlusion and illumination changes. Furthermore, the developed robot visual servoing scheme has an excellent positioning accuracy within 0.05 mm positioning error.


Introduction
Accurate positioning of objects is the crux to real-world applications of family service robots. Obviously, the basic function of service robots is that it can operate, grasp and move a specific object selected by the user freely. The first step to implement the above task is to make a robot positioning an object with high accuracy and strong robustness. At present, most visual servoing approaches, which have been widely and successfully used for applications where the object to be manipulated is known beforehand [1][2][3][4] or the scene is known beforehand [5]. Unlike industrial robots, family service robots usually work in highly unstructured environments [6,7], and need more intelligence than industrial robots to perform given tasks [8]. In this environment, the object may be "unknown", these methods need to acquire prior knowledge of the properties of these non-rigid bodies in advance. As a result, they are unable to handle unknown objects.
In view of the complexity of non-rigid body model establishment, some researchers have recently proposed methods based on model-free approaches [18][19][20][21][22][23]. The main idea of these methods is representing the target through artificial marking, which simplifies the extraction of image feature. However, these methods are labor intensive and not efficient. With the development of deep learning approaches, some researchers have proposed to use deep learning algorithms [19,20] to identify the category information of unknown objects and obtain the best grasping position of the object. However, these methods need to obtain the prior knowledge of the category information of each unknown object by off-line learning, and do not consider the problem of robot visual servoing control of unknown objects in complex natural scenes (such as occlusion and illumination changes). As a result, these methods fail to apply for natural scenes.
In this work, we propose a robot visual servoing scheme based on the randomized trees with an effort for avoiding above-mentioned shortcomings. The main objective and feature of the proposed approach are to complete the visual servoing task of unknown objects in various complex natural scenes towards more practical solutions and applications. As opposed to prior work which uses prior knowledge of objects to establish their model [13][14][15][16][17], the proposed method in this work does not need any prior information on object models, such as template or database of the object. User can randomly select an object to be manipulated prior to the visual servoing task execution. Furthermore, inspired and improved from some deep learning algorithms [19,20], the proposed method can perform well in complex natural scenes (such as occlusion and illumination changes), which is common in manipulation domains.
The main contributions and novelty of this work can be outlined as follows: & A visual servoing method based on the randomized trees for unknown objects in natural scenes is proposed. It does not need to acquire any knowledge of the object template and its natural scene in advance, and there are always many objects on the robot operating platform during the positioning process. "Unknown Objects" are randomly specified according to the needs of users, then before the robot servoing task starts, the required data only can be acquired on-line for the specified object, that is, the prior knowledge of the geometric model of the object is not needed in advance. & The proposed scheme has been tested and evaluated on a real MOTOMAN-SV3X six degree-of-freedom (6-DOF) manipulator robot. Experimental results show that the developed scheme can effectively position an unknown rigid object or non-rigid object in many challenging nature scenes with occlusion and illumination changes and with a high positioning accuracy.
The remainder of this paper is organized as follows. After discussing related work in Section 2, a brief description on the system structure is given in Section 3. The specific implementation of the three main parts of the system, including the construction of the randomized tree classifier, the computation of the current image features, and the design of the visual controller, is described in detail in Section 4. Detailed experimental results and analyses are provided in Section 5. Finally, summary and future works are presented in Section 6.

Related Work
Research on the visual servoing of objects (including rigid bodies and non-rigid bodies) can be roughly classified into model-based methods and model-free methods. In modelbased methods, the core of these methods is to build the model of the object and estimate its parameters [11, 13-17, 24, 25]. Gratal et al. [11] propose a virtual visual servoing based on saliency map to achieve the visual servoing of unknown objects in natural scenes. Using the approaches in [25,26], a robot can pick up or place a random object on the desktop, but it cannot operate on a specified object and can only remove all objects one by one on the desktop, which would possibly make the ornament to be cleaned up mistakenly. Jadav et al. [15] used a comprehensive dynamic equation to express the movement of the system to manipulate the deformed object, and then used multiple manipulators (or a claw with multiple fingers) to change the shape of the deformed object. The movement of each manipulator proposes an optimization scheme. This paper verifies the effectiveness of the proposed method through two sets of simulation experiments. However, this method heavily relies on model establishment and parameter estimation for non-rigid body, so it is not suitable for unknown objects.
In model-free methods, the key is the representation of targets. Classical methods represent objects through artificial marking [18][19][20][21][22][23]. Using the approach in [18], a robot can complete the task of picking up and placing the unknown object by marking the grasping position of the specified object in advance. But the approach is very difficult to apply to realistic unknown natural environments in the family services. Newer methods address the challenge by using deep learning algorithms. A deep learning algorithm [19,20] is used to recognize the category information of unknown objects and obtain the optimal grasping position of the objects. The approach does not need to obtain the geometric model information of the object prior to the visual servoing task execution.
However, the approach needs to acquire the prior knowledge of the category information of each unknown object by offline learning, and does not consider the problem of the robot visual servoing control for the unknown object in complex natural scenes (e.g. occlusion and illumination changes).
In this work, to solve the problems of model-based methods and model-free methods, we propose a robot visual servoing scheme under unknown objects in various complex natural scenes. Experimental results show that the proposed scheme can effectively position an unknown object in complex natural scenes with strong robustness to occlusion and illumination variations and small positioning error within 0.05 mm. It is showed that the proposed novel visual servoing scheme can further improve flexible operations of the visual servoing.

Problem Description and Overview of System Structure
Robot visual servoing is intended to control the relative pose of robot and object using the visual information. It can allow the robot to work in dynamic and uncontrolled environments [26]. There are two kinds of robot visual servoing structures, including position-based visual servoing (PBVS) and imagebased visual servoing (IBVS). This paper adopts the imagedbased visual servoing structure and eye-in-hand configuration. In this configuration, the camera is mounted on the robot endeffector so that it could be moved along with the robot. The relative pose of the robot and object is represented as the difference between the current image features and the desired image feature, then the task of robot visual servoing can be defined as minimizing the relative pose of the robot and object. In other words, under this eye-in-hand configuration and visual servoing structure, both robot visual positioning and visual tracking can be viewed as a positioning problem in an image feature space. Therefore, how to detect an object in complex scenes, to calculate the feature of object in the current image, and to design visual controller are three key issues for achieving a robot visual servoing mission in natural scenes.
To solve the above problems, this paper proposes a robot visual servoing scheme based on the randomized tree classifier in natural scenes. The overall structure of the proposed scheme is shown in Fig. 1. The system mainly includes three parts as follows: building a randomized tree classifier, computing the current image feature, and designing the visual controller. The basic idea of the algorithm is: Firstly, user randomly selects an object to be manipulated prior to the visual servoing task execution. Then, the raw image about the object can be captured by the camera and used to generate a number of sample data sets for building the randomized tree classifier. Secondly, the current image features, which are represented as 2D pixel coordinates of the object image centroid, can be computed using the previously built classifier. Finally, visual control input can be calculated according to the image feature error and is applied to the robot to achieve the robot visual positioning for unknown objects. In Fig. 1, f d represents the desired image features and f denotes the current image features (as shown by the red circle).
The basic principle of the three main parts of the servoing system will be stated individually in detail in next section.

Proposed Solution
In this section, detailed implementation of all functions highlighted in Fig. 1 will be described according to the order listed above, which includes the construction of the randomized tree classifier, the computation of the current image features, and the design of the visual controller.

Construction of the Randomized Tree Classifier
The main function of this module is to construct a randomized tree classifier for recognizing and detecting unknown objects randomly selected by the user. The whole process is shown in Fig. 2, which can be further divided into the following four steps.
Step 1. Selecting the unknown object to be manipulated.
Firstly, the object selected randomly by the user will be put on the training station whose background is clean without any clutter, and the robot is also moved to the training station. Then, the desired image can be obtained through segmenting the current image captured by the camera mounted on the robot's end-effector. If the object is a 3D non-rigid body, its rough 3D model can be obtained through the ImageModeler software, and then the 3D object image will be obtained using the above method.
Step 2. Preprocessing. The main purpose of the preprocessing is to get the gray image of the desired image.
Step 3. Extracting the object features and generating the view sets. The stable affine invariant features are extracted on the preprocessed image. Then patches, whose center is each feature point, are obtained. These patches will form view sets of image feature points of the object, or object view sets for short.
Step 4. Establishing the randomized tree classifier. A randomized tree classifier can be built using these object view sets.
The detailed implementation of feature extraction, object view sets generation and randomized tree classifier establishment are described as follows.

Feature Extraction
Characteristics of image features used in the control loop, especially for the visual servoing structure based on images, will directly affect the stability and robustness of a robot visual servoing system, and this is one of the important factors for robotic systems to be successfully applied to complex environments. Therefore, this paper adopts a two-level feature extraction method to obtain affine invariant stable features. Firstly, LOG-FAST feature extraction method is used for rapidly extracting features from the gray image of the raw object. The LOG operator is utilized to conduct Gaussian filtering and image sharpening processing so as to eliminate noises of the image. Then FAST-9 operator is used to extract the feature in different scale spaces.
To further obtain affine invariant stable features, it is necessary to conduct selection operation after the preliminary extraction. The main steps to achieve above task can be stated as follows: Step 1. Generating M new images by conducting affine transformation on the grayscale image. Affine transformation is a combination between non-singular linear transformation matrix A¼R θ R −1 φ SR φ and translation matrix t = [t x , t y ] T , where R θ and R φ are the rotation matrices that correspond to θ and φ respectively, and they are within the ranges of [-π, π]. S = diag {λ 1 , λ 2 } represents the image scaling transformation matrix, and λ 1 , λ 2 are within the range of [0.2, 1.5]. Besides, t x , t y are within the range of [0, 2]. M new images are generated by randomly selected parameters θ, φ, λ 1 , λ 2 , t x , t y , and white noise is added into the generated M new images.
Step 2. Determining affine invariant stable features. Firstly, LOG-FAST feature extraction method is adopted to Camera X e u  Fig. 2 Process of constructing a randomized tree classifier extract the features on the M new images. Then, the inverse transformation is utilized to recover the extracted feature. Finally, the successful matching frequency between the recovered features and the features of the original image is calculated. The features with the top N frequency are considered as the "stable" features, which are illustrated with red circles in Fig. 3.
These features construct a feature set K = {k 1 , k 2 , …, k N }, where 1 to N are ultimately determined stable features. Each feature is tagged to denote a class, and the different classes are indicated as C = {c 1 , c 2 , …, c N }. Then these stable features are used to construct view sets to be employed for building the randomized tree classifier.

Establishment of View Sets
When capturing a frame of new image in the robot visual servoing process, the most critical problem is to determine whether the current image involves the object and object's location is in the image. The first step for solving the above problem is to detect the features on the object. So, the view set of each feature needs to be further established after the stable features are obtained. Feature patches are extracted on the M affine transformed images, and the size of each patch is 32*32. The view sets consist of all the extracted feature patches, and the size of view sets is M*N finally. All feature patches with the same number will form a small collection, thus for the N stable features, N small sets are constructed: V n = {v n1 , v n2 , ⋯, v nm }, 1 ≤ n ≤ N,where each V n set includes m elements, and it is the view set of a feature. Figure 4 shows the view set of a certain feature, where different elements indicate the different locations for the same feature in different perspectives.

Establishment of Randomized Tree Classifier
Binary decision tree is adopted in this system, which has only one root node and two child nodes. Each node is divided into two child nodes and the other nodes follow recursively until the bottom node has no branch. The bottom node is named as leaf node. The view sets are put into the root node, and patches of each view set traverse from the root to the leaf. In the traversal, every node will test the patch to determine whether this patch belongs to a certain node. This system adopts a gray image, thus the information gain of the gray information feature is the largest. Gray information is selected as the criterion of classification in this paper. The discriminant of each node can be written as: where τ is the default threshold, I(p, m 1 )and I(p, m 2 ) represent the gray values of two pixels which are randomly selected in the patch p entering the tree. When all randomly selected patches enter the randomized tree, the number of each class's patches entering the leaf node m and the number of all patches entering the leaf node M should be calculated. The ratio between m and M is the posteriori probability that a certain feature is identified as a certain class. If the number of the ith feature class's patches entering the leaf node is m i and the number of all the patches entering the leaf node is M, then the posterior probability that the ith feature is identified as the ith feature class can be represented as: where η represents the leaf node. The leaf node stores the posterior probability P η(l, p) (Y(p) = c) of each class. In general, as the size of the sample set is large, it is difficult to guarantee a high accuracy result using one randomized tree classifier. Statistically speaking, more randomized tree classifiers can compensate this shortcoming. Therefore, the method of establishing more trees to speed up the process of the object detection is adopted. The next step is to use the built randomized tree classifier to detect objects and to compute the current image features on the current image captured by the camera in the natural scenes in real time.

Computing the Current Image Feature
The main function of this module is to detect the object in the current image and to compute the features of the current image. The whole process can be shown in Fig. 5.
where P η(l, p) represents the posterior probability of the patch p entering the leaves nodes of the lth tree T l , c is the feature class, p c (p) is the mean posterior probability of the class c and D c is the default threshold (60% used in the later experiment). When the posterior probability is larger than D c , the feature of this patch is classified as class c, and further, the center of this patch matches the feature of the class c in the desired image. Otherwise the patch belongs to the background or a misclassification and should be discarded. Thus, the match between the current features and the cth class features is implemented. When the number of matching reaches a certain threshold, it is determined that the current image contains the object. In order to calculate the current image feature, the RANSAC algorithm is used to estimate the homograph matrix H. Then H is used to calculate the centroid coordinate of the object, which is the current image features. In Fig. 5, the centroid of an object image is represented by a red circle as shown in the picture named "current feature".

Design of the Visual Controller
After obtaining the current image features, one can further design the visual controller according to the image error to drive the robot for positioning an unknown object in natural scenes. The image error e can be defined as: where f d and f are the desired image features and current image features, respectively. For an image-based visual servoing structure used in the paper, image Jacobian matrixJ im , as shown in Eq. (5), is often used to describe the relationship between the changes of the features in the image spaceḟ and spatial velocity of the robot end-effector in the robot workspaceṙ.
whereṙ is spatial velocity of the robot end-effector, including linear velocity and angular velocity. In other words,ṙ is the calculated visual control input u to be applied to the robot end-effector;ḟ can be viewed as the image feature error e. Then, a simple visual control law u based on the inverse Jacobian matrix can be designed as shown in Eq. (6).
where [υ T ω T ] T is the camera's (or robot's) velocity, and J im + is the pseudo inverse of image Jacobian matrix, and K p is the proportional control gain. The Laypunov function is defined as V = 1/2e T e, V > 0. Then, the derivation of the Laypunov functionV can be described as: For eye-in-hand configuration used in the paper, camera is mounted on the robot end-effector and it moves along with the robot. So, in this camera-robot configuration, both robot visual positioning and visual tracking are considered as the positioning problem in the image features space. Therefore,ḟ d ¼0. Then, For robot visual control input in any ith directionu i , it can easily prove that the Laypunov function satisfiesV i < 0 by using Eq. (8). Therefore, the closed-loop system is asymptotically stable according to the Lyapunov stability theory, and the robot can be controlled to move to the desired image features.

Experimental Results
Five robot visual positioning experiments are conducted, mainly including visual servoing of 2D rigid objects and non-rigid objects in several natural scenes, to validate the performance of the proposed robot visual servoing scheme for unknown objects. Figure 6 shows the experimental platform for robot visual servoing, which contains a CCD camera, a MOTOMAN SV3 6-DOF manipulator robot and a personal computer with Windows operating system and the OpenCV 2.4.1 software. The CCD camera is mounted on the robot hand claw. It moves following with the motion of the robot hand claw, while its internal and external parameters are unknown. The image plane size is 1024*768 (unit: pixel), so the desired image feature f d is the center of the image plane coordinate (512, 384) (unit: pixel). In order to evaluate the performance of the proposed robot visual servoing scheme in the real world, two types of five robotic visual positioning experiments were performed in the complex nature scenes, including unknown rigid objects and unknown non-rigid objects.
The visual control law shown in Eq. (6) is used, and the specific values of the relevant parameters are:K p = diag (0.04,0.04).
The maximum joint running speed of Motoman SV3 6-DOF robot used in the experiment can reach 300 o /s. In the experiment, for the sake of safety, the running speed of the robot is set at a low value of 10 o /s. In Windows operating system, average image capturing and processing time per frame is about 0.156 s in the experiment, which can meet the real-time control requirements of the arm robot at low speed. The average processing time of the key components (taking the initial frame images captured in experiment 5.1.1 as an example, the average processing time of the key components for different initial frame images captured in each visual servoing task is basically unchanged) is as follows:

Robot Visual Servoing of 2D Rigid Objects in Natural Scenes
The purpose of this experiment is to verify the performance of the proposed approach for unknown 2D rigid objects in complex nature scenes, including cases with no-occlusion, with occlusion and with illumination change. The detailed results are provided in Experiments 1-3.

Visual Positioning of the Unknown Object without Occlusion
Visual positioning results are shown in Fig. 7. Fig. 7a illustrates the initial frame, where the center coordinate (512, 384) of image plane is visualized by a red cross. Namely, it is the desired image featuref d . The object is represented by a green rectangular bounding box. The object centroid is visualized by a blue circle and it represents the current image featuref(k) = (u, v) T . The positioning results of middle frame and end frame are illustrated in Fig. 7b and c, respectively.

Visual Positioning of the Unknown Object with Occlusion
Visual positioning results are shown in Fig. 8. The positioning results of middle frame and end frame are illustrated in Fig. 8b and c, respectively. During the visual positioning, the object is occluded by the black Mobile Hard Disk. The experimental result shows that the robotic system still can detect the object and finish the positioning in spite of the large occlusion.

Visual Positioning with Illumination Variations
The purpose of this experiment is to verify the availability of the proposed approach under the scenarios containing overall and local illumination variations. Visual positioning results are shown in Fig. 9. Initial illumination is shown in Fig. 9a. During the visual positioning, the case of turning off the light in the robot workplace is shown in Fig. 9b, then add a beaming light as shown in Fig. 9c, then remove a beaming light as shown in Fig. 9d. Visual positioning results under the above three illumination variations are illustrated in Fig. 9 b and 9d, respectively. It can be seen from Fig. 9 that robot can successfully implement the visual servoing of unknown 2D objects in natural scene in spite of large illumination variations.

Robot Visual Servoing of Non-rigid Objects in Natural Scenes
In order to further verify the effectiveness of the proposed approach for non-rigid objects in natural scenes, another two

Visual Positioning of 2D Non-rigid Objects
The purpose of this experiment is to verify the effectiveness of the proposed approach for 2D non-rigid objects in natural scenes. A sponge is selected as a 2D non-rigid object. The experiment is designed to simulate some practical applications for service robots, such as folding the clothes and operating the soft tissue in surgery and so on. During the visual positioning, the basic shape of the sponge has undergone tremendous changes. The sponge is performed with up and down mixture non-rigid deformation and visual positioning results are shown in Fig. 10. At the very beginning of the servoing, the sponge is squeezed up and its deformation is shown in Fig. 10a. Then sponge is squeezed down and visual (a) Initial frame (b) Middle frame (c) End frame  positioning result is shown in Fig. 10b. After Fig. 10b, the sponge is folded up as shown in Fig. 10c until the end of visual positioning process as shown in Fig. 10d. It can be seen from Fig. 10 that the robot can still successfully implement the visual positioning of the 2D non-rigid object.

Visual Positioning of 3D Non-rigid Objects
The purpose of this experiment is to verify the capability of the proposed approach for 3D non-rigid objects. A plush toy is selected as a 3D non-rigid object. As the plush toy has 3D  characteristics, it is necessary to first build its rough 3D model by ImageModeler software using multiple views images of the plush toy during the visual positioning. The plush toy is then performed with certain non-rigid deformation and its pose is also changed. Visual positioning results are shown in Fig. 11. Initial pose and shape of the plush toy is shown in Fig. 11a and plush toy does not have any deformation. During the visual positioning, plush toy is squeezed and visual positioning result is shown in Fig. 11b. Then the pose of plush toy is changed and visual positioning result is shown in Fig. 11c. Continuing to be squeezed until the end of positioning as shown in Fig. 11d, it can be seen from Fig. 11 that the robot can still successfully implement the visual positioning of such a 3D non-rigid object.

Further Analysis and Discussion
Positioning trajectory curves in the different directions and different spaces can provide more insight into the visual servoing behavior. During the visual positioning, pose of the object can be changed freely. In this experiment, pose of the object is changed at the late servoing when the robot is nearby the desired position. Visual positioning results are shown in Fig. 12. The positioning trajectory of object is shown in Fig. 12a in the image plane u and v, and the positioning trajectory of robot is shown in Fig. 12f in the robot workspace x and y. Positioning error curves in the image plane, u and v directions are shown in Fig. 12b, c and d, respectively. It can be seen from Fig. 12 that the proposed visual servoing scheme has a better convergence performance. Table 1 lists the mean, standard deviation (std) and max of the robot positioning error in 10 different poses of the 2D object without occlusion. The "max" represents the ratio between the max absolute of 10 times positioning error and the motion range of this direction. The model of CCD camera used in the experiment is MV-VS078FC-L, the image resolution is 1024*768, and the corresponding pixel size is 4.65μm*4.65 μm, that is, the physical size of each pixel in x and y directions is 0.00465 mm. It can be seen from Table 1 that the positioning error in x and y directions are about 0.004 mm and 0.048 mm respectively and the largest absolute relative error and standard deviation are also very small. Therefore, the developed positioning system has high accuracy with a positioning error within e ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi e 2 x þ e 2 y q ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0:004 2 þ 0:048 2 p ¼ 0:05mm.
As showed in Table 1, the five positioning experimental testing results show that the proposed approach of robot visual servoing based on randomized tree classifier can implement visual positioning of unknown objects in complex natural scenes with occlusion and illumination variations.

Conclusion and Future Works
This paper proposes a robot visual servoing scheme to locate the robot to unknown objects in natural scenes with family services as application scenario. Five visual positioning experiments for unknown rigid object and non-rigid object in various nature scenes are conducted on a MOTOMAN-SV3X six degree-of-freedom manipulator robot. Experimental results show that the proposed scheme can effectively position an unknown object in complex natural scenes with strong robustness to occlusion and illumination variations and small positioning error within 0.05 mm. Furthermore, the system does not need any template nor any database of the object prior to the visual servoing task execution. Once the object is selected by the user freely, all the needed data can be obtained online and the robot can complete the positioning task automatically.
However, current method cannot position the unknown multiple objects at the same time. User can only specify one object for visual servoing task before performing the visual servoing task. If the current visual servoing task contains multiple targets, we have to reuse our method, which will become more time-consuming. In other words, our method is only suitable for the case where the number of targets is small and is not suitable for the case where the number of targets is too large.
As future works, the proposed robot visual servoing scheme will be further extended into two aspects: 1) to position the unknown multiple objects by combining multiple objects recognition and detection approaches [27,28]; 2) to autonomously grasp the unknown object by combining deep reinforcement learning in view of its strong learning capability. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.