Encyclopedia of Systems and Control

Living Edition
| Editors: John Baillieul, Tariq Samad

2.5D Vision-Based Estimation

  • Jian ChenEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4471-5102-9_100148-1


2.5D vision-based techniques, also known as hybrid vision-based techniques, provide flexible ways to estimate the range or velocity of moving objects. The information from both the image space (2D) and the Cartesian space (3D) is simultaneously utilized to construct the system state in this technology, which overcomes the disadvantages of the traditional visual serving schemes. It has been widely adopted in Motion from structure, structure from motion, and structure and motion problems.


2.5D Vision Hybrid visual servoing Vision-based estimation Motion from structure Structure from motion Structure and motiom 


As a class of powerful tools widely utilized in robotics, the 2.5D vision-based techniques arise from visual servoing at the end of the twentieth century (Malis and Chaumette 1999). Visual servoing aims at increasing the flexibility, accuracy, and robustness of a closed-loop robotic system using real-time visual feedback (Hutchinson et al. 1996; Chaumette and Hutchinson 2006; Janabi-Sharifi et al. 2011). Different visual servo algorithms mainly differ in the construction of states. There are two traditional schemes: image-based visual servoing (IBVS) and position-based visual servoing (PBVS). IBVS utilizes the image space features to construct the closed-loop error system, and PBVS employs the Cartesian space features. However, both schemes have intrinsic drawbacks.

Originally, the 2.5D vision-based techniques are proposed to improve the performance of the traditional IBVS and PBVS schemes. The basic characteristics of the 2.5D vision-based control are simultaneously utilizing the information from both the image space (2D) and the Cartesian space (3D) to construct the states. Therefore, it is also referred to as the hybrid servoing. The 2.5D vision-based control can overcome many shortcomings of IBVS and PBVS because it provides flexible ways to manipulate the translational motion and the rotational motion individually, which greatly facilitates the control development. The classic regulation task of a robotic arm with six degrees of freedoms (DoFs) (Malis and Chaumette 1999, 2000) provides a great tutorial to understand the 2.5D vision-based control.

Due to the effectiveness and advantages of the 2.5D vision-based control, the 2.5D vision-based techniques are also widely adopted in another essential problem: vision-based estimation. Vision-based estimation aims at identifying the unknown key information of an object using visual sensors. There have been many related applications. In the vision-based estimation problem, the measurable variables are utilized as feedback to close the loop. In the estimation problem, the 2.5D vision-based techniques affect the construction of states, which is similar to the control problem. According to distinct design goals, the estimation problems can be roughly divided into motion from structure (MfS) problems, structure from motion (SfM) problems, and structure and motion (SaM) problems.

2.5D Vision-Based Estimation

The dynamic model used in vision-based estimation problems can be formulated as
$$\displaystyle \begin{aligned} \begin{array}{rcl} \left\{ \begin{aligned} &\displaystyle \dot{s} = f_s(s, \mathbf{v}, \phi) \\ &\displaystyle y_m = g(s) \end{aligned} \right. {} \end{array} \end{aligned} $$
where s(t) is the system state, v(t) is the camera velocity, ϕ comprises some parameters, and fs(⋅) is a function which describes how s(t), v(t), and ϕ determine the evolution of the system state. In (1), ym(t) is the measurable output, and g(⋅) describes the relationship between ym(t) and s(t). In terms of 2.5D vision-based estimation, the system state is generally constructed in the following form:
$$\displaystyle \begin{aligned} s &= [ \underbrace{s_t^T,}_{\mathrm{Translation}} \ \underbrace{s_r^T}_{\mathrm{Rotation}} ]^T{}\\ &= [ \underbrace{p^T,}_{\text{Image }\ \text{space (2D)}} \ \ \underbrace{\alpha, \ \ \ \ s_r^T}_{\text{Cartesian }\ \text{space (3D)}} ]^T \end{aligned} $$
where \( s_t(t) \in \mathbb {R}^3 \) and \( s_r(t) \in \mathbb {R}^3 \) indicate the translational and rotational components, respectively. Specially, st(t) consists of \( p(t) \in \mathbb {R}^2 \) (the image coordinates of the feature point) and \( \alpha (t) \in \mathbb {R} \) (a factor related to the depth of the feature point). It is clear that p(t) comes from the image space (2D) and α(t) and sr(t) come from the Cartesian space (3D).

Motion from Structure

The MfS problem utilizes the known object geometry to estimate the relative motion between the camera and the object, i.e., s(t) is measurable (ym(t) = s(t)) and v(t) remains to be determined. Given that the dynamics fs(⋅) is known, the objective can be achieved by estimating \( \dot {s}(t) \). In (3), an appropriately designed observer h(⋅) can guarantee that \( \hat {s}(t) \rightarrow s(t) \) and \( \dot {\hat {s}}(t) \rightarrow \dot {s}(t) \). Then, the velocity v(t) can be calculated utilizing the known fs(⋅).

The two critical requirements mentioned before (fs(⋅) is known and s is measurable) can be easily satisfied by using the 2.5D vision-based techniques with the aid of the homography. For example, Chitrakarana et al. (2005) utilize this strategy to asymptotically identify the six DoF velocity of a moving object using a single fixed camera. A reference state sd is introduced and an error signal is constructed as e(t) = s(t) − sd. Then, a computable interaction matrix is derived which relates the variation of the error \( \dot {e} \) to the velocity v. Next, a nonlinear continuous estimator is designed to identify \( \dot {e} \) asymptotically which is further utilized to determine the velocity provided that a single geometric length between two feature points and the rotation information between the camera and the reference image are known.

Structure from Motion

Contrary to the MfS problem, the SfM problem utilizes the known relative motion between the camera and the object to reconstruct the object geometry. In other words, v(t) is known, while the unknown structure information may exist in s(t) (when s(t) is partially unmeasurable) or ϕ (when s(t) is measurable). Let the coordinates of a feature point expressed in the camera frame be denoted as P = (X, Y, Z)T where Z indicates the depth.

When the structure information exists in s(t), the available information comes from both v(t) and ym(t). This happens when the factor α(t) takes the absolute depth (α = Z or \( \alpha = \ln (Z) \)). The observer is designed in the following way:
$$\displaystyle \begin{aligned} \dot{\hat{s}} = h(y_m,\mathbf{v}). \end{aligned} $$
With some necessary conditions satisfied, an appropriate observer can drive the estimation error \( \tilde {s}(t) \) to zero, which means \( \hat {s}(t) \rightarrow s(t) \). Sometimes, it is not practical to directly estimate the state. Under such circumstances, indirect methods can be adopted in which some auxiliary variables are designed at first and then be used to construct the state estimator. Suppose that the auxiliary variable is denoted as ζ(t), this strategy can be formulated as
$$\displaystyle \begin{aligned} \left\{ \begin{aligned} \dot{\zeta} = & \psi (\zeta, y_m, \mathbf{v}) \\ \hat{s} = & h(\zeta, y_m) . \end{aligned} \right. \end{aligned} $$
Dani et al. (2011) adopt this strategy to recast the structure estimation of a moving object into an unknown input observer design problem. Consequently, a causal algorithm is provided for estimating the structure of a moving object using a moving camera with relaxed assumptions on the object motion.
When the structure information exists in ϕ, s(t) is available (ym(t) = s(t)), and the relationship between s(t) and ϕ should be figured out to estimate ϕ. This situation occurs when α(t) takes the ratio of depths (\( \alpha = \frac {Z_d}{Z} \) or \( \alpha = \ln \left ( \frac {Z_d}{Z} \right ) \)) because this ratio can be directly calculated by means of homography techniques. Under this circumstance, both s(t) and v(t) can be utilized to construct the observer for ϕ.
$$\displaystyle \begin{aligned} \dot{\hat{\phi}} = h(s, \mathbf{v}). \end{aligned} $$
This method is adopted in Chen et al. (2011) where the state is decomposed as the multiplication of two matrices and a structure-related vector. This relationship is further utilized to estimate the state-related vector. As a matter of fact, Chen et al. (2011) deals with the SaM problem which will be introduced in detail later.

Structure and Motion

The SaM problem focuses on identifying both the Cartesian coordinates of the feature points and the relative motion information (Dani et al. 2012). In SaM problems, both s(t) and v(t) are (partially) unmeasurable and to be determined, which leads to two kinds of methods.

The intuitive strategy is using more prior knowledge about either the object or the camera. One factor should be estimated first and then facilitates the estimation of the other one, such as the following form:
$$\displaystyle \begin{aligned} \left\{ \begin{aligned} \dot{\hat{\mathbf{v}}} = & h_1(y_m,\phi) \\ \dot{\hat{s}} = & h_2(\hat{\mathbf{v}},y_m) . \end{aligned} \right. \end{aligned} $$
This strategy is utilized by Chen et al. (2011) based on Chitrakarana et al. (2005). Similar to Chitrakarana et al. (2005), the rotation between the reference frame and the inertial frame is required to be known as well as the coordinates of a feature point on the object in this work. This strategy is also adopted by Dani et al. (2012). It is required in Dani et al. (2012) that at least one linear velocity of the camera should be known. The angular velocity is estimated using a filter. The estimated angular velocity and a measured linear velocity are combined to estimate the scaled 3D coordinates of the feature points.
The alternative choice to deal with SaM problems is to introduce multiple objects or cameras which conduct different motions, i.e., one is static, while the other one is moving. Then, multiple states and corresponding dynamics become available
$$\displaystyle \begin{aligned} \left\{ \begin{aligned} \dot{s}_1 = & f_{s1}(s_1,{\mathbf{v}}_1,\phi) \\ \dot{s}_2 = & f_{s2}(s_2,{\mathbf{v}}_2,\phi) . \end{aligned} \right. \end{aligned} $$
In the equations above, if v1(t) and ϕ are unknown, all other available variables can be adopted in observers to determine v1(t) and ϕ. Although the two dynamics are different, there is common information between them which facilitates the observer development.
$$\displaystyle \begin{aligned} \left\{ \begin{aligned} \dot{\hat{\bar{\mathbf{v}}}}_1 = & h_1(s_1,s_2) \\ \dot{\hat{\phi}} = & h_2(s_1,s_2,{\mathbf{v}}_2) . \end{aligned} \right. {} \end{aligned} $$
In (9), \( \bar {\mathbf {v}}_1(t) \) is the normalized value of v1(t) up to a scale resulting from the unknown absolute range. As the range is contained in ϕ(t), \( \hat {\phi } \) will further benefit the estimation of v1(t).
$$\displaystyle \begin{aligned} \dot{\hat{\mathbf{v}}}_1 = h_3(\hat{\bar{\mathbf{v}}}_1,\hat{\phi}). \end{aligned} $$
This strategy is adopted in the work of Chwa et al. (2015) where a scene comprising both static and moving objects is introduced to eliminate the dependence on a priori knowledge of the object. Similarly, Chen et al. (2018) extend their previous work Chen et al. (2011) by introducing a static-moving camera configuration to identify the velocity and range of the feature points on a moving object simultaneously. Unlike Chen et al. (2011), it is no longer required that the coordinates of a feature point on the object should be known in Chen et al. (2018).

Future Directions for Research

The 2.5D vision-based techniques have attracted much interest since they were proposed. However, there are still many aspects that can be improved such as the range of applications and rigorous theoretical analysis.
  • The existing applications of 2.5D vision-based techniques have been confined to robotic arms with 6 DoFs. Much effort has been devoted to vision-based control and estimation on wheeled mobile robots (WMRs). However, the existing works are mainly image-based (Mariottini et al. 2007) or position-based (Zhang et al. 2018). Applying the 2.5D vision-based techniques on WMRs is still faced with many challenges due to the nonholonomic constraints.

  • Many works mention that the 2.5D-based scheme increases the likelihood that the object will stay in the camera field of view (FoV) (Malis and Chaumette 1999; Chen et al. 2005), but few of them provide sufficiently persuasive and rigorous theoretical analysis. Chen et al. (2007) adopt an image space navigation function together with the 2.5D-based scheme to generate a Cartesian space trajectory which ensures all feature points remain visible. However, the tracking errors may still result in the feature points leaving the FoV. Parikh et al. (2017) provide a different perspective on this issue for reference. They investigate the state estimation without visual feedback when the feature points are out of the FoV by means of switched systems. However, this strategy leads to the stabilization problem of the overall system.

  • In many works, the local minimums have not been rigorously analyzed. A finite number of simulations or experiments cannot guarantee the global convergence. Zhang et al. (2019) make a good demonstration of investigating the state space and analyzing the existence of multiple equilibriums. This helps in figuring out whether the global convergence holds or what the influence of those undesired equilibriums is.



  1. Chaumette F, Hutchinson S (2006) Visual servo control part I: Basic approaches. IEEE Robot Autom Mag 13(4):82–90CrossRefGoogle Scholar
  2. Chen J, Dawson DM, Dixon WE, Behal A (2005) Adaptive homography-based visual servo tracking for a fixed camera configuration with a camera-in-hand extension. IEEE Trans Control Syst Technol 13(5):814–825CrossRefGoogle Scholar
  3. Chen J, Dawson DM, Dixon WE, Chitrakaran VK (2007) Navigation function-based visual servo control. Automatica 43(7):1165–1177MathSciNetCrossRefGoogle Scholar
  4. Chen J, Chitrakaran VK, Dawson DM (2011) Range identification of features on an object using a single camera. Automatica 47(1):201–206MathSciNetCrossRefGoogle Scholar
  5. Chen J, Zhang K, Jia B, Gao Y (2018) Identification of a moving object’s velocity and range with a static-moving camera system. IEEE Trans Autom Control 63(7):2168–2175MathSciNetCrossRefGoogle Scholar
  6. Chitrakarana VK, Dawson DM, Dixon WE, Chen J (2005) Identification of a moving object’s velocity with a fixed camera. Automatica 41(3):553–562MathSciNetCrossRefGoogle Scholar
  7. Chwa D, Dani AP, Dixon WE (2015) Range and motion estimation of a monocular camera using static and moving objects. IEEE Trans Control Syst Technol 24(4):1174–1183CrossRefGoogle Scholar
  8. Dani AP, Kan Z, Fischer NR, Dixon WE (2011) Structure estimation of a moving object using a moving camera: An unknown input observer approach. In: 50th IEEE conference on decision and control and European control conference, Orlando, pp 5005–5010Google Scholar
  9. Dani AP, Fischer NR, Dixon WE (2012) Single camera structure and motion. IEEE Trans Autom Control 57(1):241–246MathSciNetCrossRefGoogle Scholar
  10. Hutchinson S, Hager GD, Corke PI (1996) A tutorial on visual servo control. IEEE Trans Robot Autom 12(5):651–670CrossRefGoogle Scholar
  11. Janabi-Sharifi F, Deng L, Wilson WJ (2011) Comparison of basic visual servoing methods. IEEE/ASME Trans Mechatron 16(5):967–983CrossRefGoogle Scholar
  12. Malis E, Chaumette F (1999) 2-1/2D visual servoing. IEEE Trans Robot Autom 15(2):238–250Google Scholar
  13. Malis E, Chaumette F (2000) 2 1/2 D visual servoing with respect to unknown objects through a new estimation scheme of camera displacement. Int J Comput Vis 37(1):79–97CrossRefGoogle Scholar
  14. Mariottini GL, Oriolo G, Prattichizzo D (2007) Image-based visual servoing for nonholonomic mobile robots using epipolar geometry. IEEE Trans Robot 23(1): 87–100CrossRefGoogle Scholar
  15. Parikh A, Cheng TH, Chen HY, Dixon WE (2017) A switched systems framework for guaranteed convergence of image-based observers with intermittent measurements. IEEE Trans Robot 33(2):266–280CrossRefGoogle Scholar
  16. Zhang K, Chen J, Li Y, Gao Y (2018) Unified visual servoing tracking and regulation of wheeled mobile robots with an uncalibrated camera. IEEE/ASME Trans Mechatron 23(4):1728–1739CrossRefGoogle Scholar
  17. Zhang K, Chaumette F, Chen J (2019) Trifocal tensor-based 6-DOF visual servoing. Int J Robot Res 38(10–11):1208–1228CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2020

Authors and Affiliations

  1. 1.State Key Laboratory of Industrial Control Technology, College of Control Science and EngineeringZhejiang UniversityHangzhouChina

Section editors and affiliations

  • Warren Dixon
    • 1
  1. 1.Dept. of Mechanical and AeroMechanical and Aerospace EngineeringUniversity of Florida, Room 334, MAE-BGainesvilleUSA