Skip to main content

Sparse pose manifolds


The efficient manipulation of randomly placed objects relies on the accurate estimation of their 6 DoF geometrical configuration. In this paper we tackle this issue by following the intuitive idea that different objects, viewed from the same perspective, should share identical poses and, moreover, these should be efficiently projected onto a well-defined and highly distinguishable subspace. This hypothesis is formulated here by the introduction of pose manifolds relying on a bunch-based structure that incorporates unsupervised clustering of the abstracted visual cues and encapsulates appearance and geometrical properties of the objects. The resulting pose manifolds represent the displacements among any of the extracted bunch points and the two foci of an ellipse fitted over the members of the bunch-based structure. We post-process the established pose manifolds via \(l_1\) norm minimization so as to build sparse and highly representative input vectors that are characterized by large discrimination capabilities. While other approaches for robot grasping build high dimensional input vectors, thus increasing the complexity of the system, in contrast, our method establishes highly distinguishable manifolds of low dimensionality. This paper represents the first integrated research endeavor in formulating sparse pose manifolds, with experimental results providing evidence of low generalization error, justifying thus our theoretical claims.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. Grasping a pliers:, Grasping a box-shaped object:, pose estimation of a car:, pose estimation of a box-shaped object


  • Agrawal, A., Sun, Y., Barnwell, J., & Raskar, R. (2010). Vision-guided robot system for picking objects by casting shadows. IJRR, 29, 155–173.

    Google Scholar 

  • Andreopoulos, A., Tsotsos, J. (2009). A theory of active object localization. ICCV (pp. 903–910).

  • Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404.

    Article  Google Scholar 

  • Ben Amor, H., Kroemer, O., Hillenbrand, U., Neumann, G., Peters, J. (2012). Generalization of human grasping for multi-fingered robot hands. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. (pp. 2043–2050).

  • Berg, A., Berg, T., & Malik, J. (2005). Shape matching and object recognition using low distortion correspondences. CVPR, 1, 26–33.

    Google Scholar 

  • Bishop, C. (2006). Pattern recognition and machine learning. Volume 4. New York: springer.

    Google Scholar 

  • Blender. (2011). Blender 3d model creator. (

  • Bohg, J., Kragic, D. (2009). Grasping familiar objects using shape context. International Conference on Advanced Robotics (pp. 1–6).

  • Castrodad, A., & Sapiro, G. (2012). Sparse modeling of human actions from motion imagery. International journal of computer vision, 100, 1–15.

    Article  Google Scholar 

  • Chan, A., Croft, E., Little, J. (2011). Constrained manipulator visual servoing (cmvs): Rapid robot programming in cluttered workspaces. IROS (pp. 2825–2830).

  • Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Review, 43, 129–159.

    Article  MATH  MathSciNet  Google Scholar 

  • Cheng, B., Yang, J., Yan, S., Fu, Y., & Huang, T. S. (2010). Learning with l1-graph for image analysis. IEEE Transactions on Image Processing, 19, 858–866.

    Article  MathSciNet  Google Scholar 

  • Choi, C., Baek, S., Lee, S. (2008). Real-time 3d object pose estimation and tracking for natural landmark based visual servo. IROS (pp. 3983–3989).

  • Detry, R., Piater, J. (2011). Continuous surface-point distributions for 3d object pose estimation and recognition. ACCV (pp. 572–585).

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). New York: Wiley.

    MATH  Google Scholar 

  • Eberhart, R., Shi, Y., & Kennedy, J. (2001). Swarm intelligence. The Morgan Kaufmann Series in Evolutionary Computation. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Evermotion, T.M. (2012). Evermotion 3d models. (

  • Fergus, R., Perona, P., & Zisserman, A. (2007). Weakly supervised scale-invariant learning of models for visual recognition. IJCV, 71, 273–303.

    Article  Google Scholar 

  • Ferrari, V., Tuytelaars, T., & Van Gool, L. (2006). Simultaneous object recognition and segmentation from single or multiple model views. IJCV, 67, 159–188.

    Article  Google Scholar 

  • Fischler, M., & Bolles, R. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24, 381–395.

    Article  MathSciNet  Google Scholar 

  • Guha, T., & Ward, R. K. (2012). Learning sparse representations for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 1576–1588.

    Article  Google Scholar 

  • Hebert, P., Hudson, N., Ma, J., Howard, T., Fuchs, T., Bajracharya, M., Burdick, J. (2012). Combined shape, appearance and silhouette for simultaneous manipulator and object tracking. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, IEEE (pp. 2405–2412).

  • Hinterstoisser, S., Benhimane, S., Navab, N. (2007). N3m: Natural 3d markers for real-time object detection and pose estimation. ICCV (pp. 1–7).

  • Hsiao, E., Collet, A., Hebert, M. (2010). Making specific features less discriminative to improve point-based 3d object recognition. CVPR (pp. 2653–2660).

  • Jolliffe, I. (1986). Principal Component Analysis. New York: Springer Verlag.

    Book  Google Scholar 

  • Kouskouridas, R., Gasteratos, A., & Badekas, E. (2012). Evaluation of two-part algorithms for objects’ depth estimation. Computer Vision, IET, 6, 70–78.

    Article  MathSciNet  Google Scholar 

  • Kouskouridas, R., Gasteratos, A., & Emmanouilidis, C. (2013). Efficient representation and feature extraction for neural network-based 3d object pose estimation. Neurocomputing, 120, 90–100.

    Article  Google Scholar 

  • Kragic, D., Björkman, M., Christensen, H., & Eklundh, J. (2005). Vision for robotic object manipulation in domestic settings. RAS, 52, 85–100.

    Google Scholar 

  • Krainin, M., Henry, P., Ren, X., & Fox, D. (2011). Manipulator and object tracking for in-hand 3d object modeling. IJRR, 30, 1311–1327.

    Google Scholar 

  • Leibe, B., Leonardis, A., Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. Workshop, ECCV (pp. 17–32).

  • Lippiello, V., Siciliano, B., & Villani, L. (2007). Position-based visual servoing in industrial multirobot cells using a hybrid camera configuration. IEEE Transactions on Robotics, 23, 73–86.

    Article  Google Scholar 

  • Lippiello, V., Ruggiero, F., & Siciliano, B. (2011). Floating visual grasp of unknown objects using an elastic reconstruction surface. IJRR, 70, 329–344.

    Google Scholar 

  • Lowe, D. (1999). Object recognition from local scale-invariant features. ICCV, 2, 1150–1157.

    Google Scholar 

  • Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60, 91–110.

    Article  Google Scholar 

  • Ma, J., Chung, T., & Burdick, J. (2011). A probabilistic framework for object search with 6-dof pose estimation. IJRR, 30, 1209–1228.

    Google Scholar 

  • Mason, M., Rodriguez, A., & Srinivasa, S. (2012). Autonomous manipulation with a general-purpose simple hand. IJRR, 31, 688–703.

    Google Scholar 

  • Mei, L., Liu, J., Hero, A., Savarese, S. (2011). Robust object pose estimation via statistical manifold modeling. ICCV (pp. 967–974).

  • Mei, L., Sun, M., M.Carter, K., III, A.O.H., Savarese, S. (2009). Object pose classification from short video sequences. BMVC.

  • Nayar, S., Nene, S., Murase, H. (1996). Columbia object image library (coil 100). Technical report, Tech. Report No. CUCS-006-96. Department of Comp. Science, Columbia University.

  • Oikonomidis, I., Kyriazis, N., Argyros, A. (2011). Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. ICCV (pp. 2088–2095).

  • Pang, Y., Li, X., & Yuan, Y. (2010). Robust tensor analysis with l1-norm. IEEE Transactions on Circuits and Systems for Video Technology, 20, 172–178.

    Google Scholar 

  • Popovic, M., Kraft, D., Bodenhagen, L., Baseski, E., Pugeault, N., Kragic, D., et al. (2010). A strategy for grasping unknown objects based on co-planarity and colour information. RAS, 58, 551–565.

    Google Scholar 

  • Qiao, L., Chen, S., & Tan, X. (2010). Sparsity preserving projections with applications to face recognition. Pattern Recognition, 43, 331–341.

    Google Scholar 

  • Rasolzadeh, B., Björkman, M., Huebner, K., & Kragic, D. (2010). An active vision system for detecting, fixating and manipulating objects in the real world. IJRR, 29, 133–154.

    Google Scholar 

  • Savarese, S., Fei-Fei, L. (2007) 3d generic object categorization, localization and pose estimation. ICCV (pp. 1–8).

  • Saxena, A., Driemeyer, J., Kearns, J., Osondu, C., Ng, A. (2008). Learning to grasp novel objects using vision. In Experimental Robotics, (pp. 33–42) Berlin: Springer .

  • Saxena, A., Driemeyer, J., Kearns, J., & Ng, A. (2006). Robotic grasping of novel objects. Neural Information Processing Systems, 19, 1209–1216.

    Google Scholar 

  • Saxena, A., Wong, L., Quigley, M., & Ng, A. Y. (2011). A vision-based system for grasping novel objects in cluttered environments. Robotics Research, 18, 337–348.

    Google Scholar 

  • Schölkopf, B., Smola, A.J., Müller, K.R. (1997). (pp. 583–588). Kernel principal component analysis. ICANN.

  • Schölkopf, B., & Smola, A. J. (2002). Learning with kernels : support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.

    Google Scholar 

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. PAMI, 22, 888–905.

    Article  Google Scholar 

  • Shubina, K., & Tsotsos, J. (2010). Visual search for an object in a 3d environment using a mobile robot. CVIU, 114, 535–547.

    Google Scholar 

  • Srinivasa, S., Ferguson, D., Helfrich, C., Berenson, D., Collet, A., Diankov, R., et al. (2010). Herb: a home exploring robotic butler. Autonomous Robots, 28, 5–20.

    Article  Google Scholar 

  • Torabi, L., & Gupta, K. (2012). An autonomous six-dof eye-in-hand system for in situ 3d object modeling. IJRR, 31, 82–100.

    Google Scholar 

  • Tsaig, Y., & Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52, 1289–1306.

    Article  Google Scholar 

  • Tsoli, A., Jenkins, O. (2008). Neighborhood denoising for learning high-dimensional grasping manifolds. IROS (pp. 3680–3685).

  • Viksten, F., Forssen, P., Johansson, B., Moe, A. (2009). Comparison of local image descriptors for full 6 degree-of-freedom pose estimation. ICRA (pp. 2779–2786).

  • Vonikakis, V., Kouskouridas, R., Gasteratos, A. (2013). A comparison framework for the evaluation of illumination compensation algorithms. IST 2013 IEEE International Conference on. (pp. 264– 268).

  • Wang, J., Sun, X., Liu, P., She, M. F., & Kong, L. (2013). Sparse representation of local spatial-temporal features with dimensionality reduction for motion recognition. Neurocomputing, 100, 134–143.

    Article  Google Scholar 

  • Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Ma, Y. (2009). Robust face recognition via sparse representation. PAMI, 31, 210–227.

    Article  Google Scholar 

  • Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction. PAMI, 29, 40–51.

    Article  Google Scholar 

  • Yuan, C., & Niemann, H. (2001). Neural networks for the recognition and pose estimation of 3d objects from a single 2d perspective view. IMAVIS, 19, 585–592.

    Google Scholar 

  • Zou, H., Hastie, T., & Tibshirani, R. (2004). Sparse principal component analysis. JCGS, 15, 2006.

    Google Scholar 

Download references


The authors would like to thank Nikolaos Metaxas-Mariatos for his help in conducting the experimental validation of the proposed method.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Rigas Kouskouridas.



Formulation of bunch-based architecture

Given an image of an object with certain pose, we first extract the 2D locations of the \(\rho \) SIFT keypoints denoted as \(\mathbf v _{\varvec{\zeta }}\in \mathbb {R}^2\). Then we perform clustering over the locations of the extracted interest points (input vectors \(\mathbf v _{\varvec{\zeta }},\, \zeta =1,2,\dots ,\rho \)) in order to account for the topological attributes of the object. Supposing there are \(\gamma \) clusters denoted as \(\mathbf b _\mathbf k ,\,k=1,2,\dots \gamma \), we consider basic Bayesian rules noting that a vector \(\mathbf v _{\varvec{\zeta }}\) belongs to a cluster \(\mathbf b _\mathbf k \) if \(P(\mathbf b _\mathbf k |\mathbf v _{\varvec{\zeta }})>P(\mathbf b \varvec{_\zeta }|\mathbf v _\mathbf j ),\,\zeta ,k=1,2,\dots ,\gamma ,\,\zeta \ne j\). The expectation of the unknown parameters conditioned on the current estimates \({\varvec{\varTheta }} (\tau )\) (\(\tau \) denotes the iteration step) and the training samples (E-step of the EM algorithm) are:

$$\begin{aligned} J({\varvec{\varTheta }};&{\varvec{\varTheta }}(\tau ))=E\Big [ \sum _{i=1}^{\rho }ln(p(\mathbf v _{\varvec{\zeta }}|\mathbf b _\mathbf k ;\varvec{\theta })P_\mathbf k ) \Big ] \nonumber \\&=\sum _{\zeta =1}^{\rho }\sum _{k=1}^{\gamma }P(\mathbf b _\mathbf k |\mathbf v _{\varvec{\zeta }}; {\varvec{\varTheta }}(\tau ))ln(p(\mathbf v _{\varvec{\zeta }}|\mathbf b _\mathbf k ;\varvec{\theta })P_\mathbf k ) \end{aligned}$$

with \(\mathbf P _{1\times \gamma }=[P_1,\dots ,P_\gamma ]^T\) denoting the a priori probability of the respective clusters, \(\widehat{\varvec{\theta }_{2\times \gamma }}=[\varvec{\theta }_1^T, \dots , \varvec{\theta }\gamma ^T]^T\) corresponding to the \(\varvec{\theta _k}\) vector of parameters for the \(k-th\) cluster and \({\varvec{\varTheta }}_{3\times \gamma }=[\hat{\varvec{\theta }^T},\mathbf P ^T]^T\). According to M-step of the EM algorithm, the parameters of the \(\gamma \) clusters in the respective subspace are estimated through the maximization of the expectation:

$$\begin{aligned} {\varvec{\varTheta }}(\tau +1)=\arg \max _{\varvec{\varTheta }}J({\varvec{\varTheta }};{\varvec{\varTheta }}(\tau )) \end{aligned}$$

resulting in

$$\begin{aligned} \sum _{\zeta =1}^{\rho }\sum _{k=1}^{\gamma }P(\mathbf b _\mathbf k |\mathbf v _{\varvec{\zeta }};{\varvec{\varTheta }}(\tau ))\frac{\partial }{\partial \varvec{\theta }_\mathbf k }ln(p(\mathbf v _{\varvec{\zeta }}|\mathbf b _\mathbf k ;\varvec{\theta }_\mathbf k ))=0 \end{aligned}$$

while maximization with respect to the a priori probability P gives:

$$\begin{aligned} P_k=\frac{1}{\rho }\sum _{\zeta =1}^{\rho }P(\mathbf{b}_\mathbf{k}|\mathbf{v}_{\varvec{\zeta }};{\varvec{\varTheta }}(\tau ))\qquad \quad \text { with } k=1,\dots ,\gamma \end{aligned}$$

It is apparent that the optimization of Eq. 8 with respect to P stands for a constraint maximization problem that has to obey to \(P_k\ge 0,\,k=1,\dots ,\gamma \) and \(\sum _{k=1}^{\gamma }P_k=1\). We revise the Lagrangian theory that states:

Given a function \(f(x)\) to be optimized subject to several constraints built the corresponding Lagrangian function as \(\mathcal {L}(x,\lambda )=f(x)-\sum \lambda f(x)\).

Following on from the above statement, we denote the respective (to Eq. 6) Lagrangian function as:

$$\begin{aligned} \mathcal {J}(\mathbf P ,\lambda )=J({\varvec{\varTheta }};{\varvec{\varTheta }}(\tau ))-\lambda \Big (\sum _{k=1}^{\gamma }P_k-1\Big ) \end{aligned}$$

We obtain \(\lambda \) and \(P_k\) though:

$$\begin{aligned}&\partial \frac{\mathcal {J}(\mathbf P ,\lambda )}{\partial P_k}=0 \Rightarrow \\&\partial \frac{\Big ( \sum _{\zeta =1}^{\rho }\sum _{k=1}^{\gamma }P(\mathbf b _\mathbf k |\mathbf v _{\zeta }; {\varvec{\varTheta }}(\tau ))ln(p(\mathbf v _{\zeta }|\mathbf b _\mathbf k ;\varvec{\theta })P_\mathbf k )\Big )}{\partial P_k}-\\&-\frac{\lambda (\sum _{k=1}^{\gamma }P_k-1}{\partial P_k}=0 \Rightarrow \\&P_k=\frac{1}{\lambda }\sum _{\zeta =1}^{\lambda }P(\mathbf b _\mathbf k |\mathbf v _{\zeta };{\varvec{\varTheta }}(\tau )) \\&\text {Since } \sum _{k=1}^{\gamma }P_k=1, \text {we can derive that } \lambda =\rho \\&\text {resulting in the final } a priori \text { probability of the } k-th \\&\text {cluster of Eq. 9:}\\&P_k=\frac{1}{\rho }\sum _{\zeta =1}^{\rho }P(\mathbf b _\mathbf k |\mathbf v _{\zeta };{\varvec{\varTheta }}(\tau ))\qquad \qquad \text { with } k=1,\dots ,\gamma \end{aligned}$$

Training with noise

The performance of the proposed regressor-based 3D pose estimation module is bootstrapped by adding noise to the input vectors fed to the RBF-kernel during training. In the following passage we present a slightly modified version of the Tikhonov regularization theorem as adjusted to the needs of our case. In cases where the inputs do not contain noise and the size \(\mathbf t \) of the training dataset tends to infinity, the error function containing the joint distributions \(p(\mathbf y \varvec{_\lambda },\mathbf r )\) (of the desired values for the network output \(\mathbf g \varvec{_\lambda }\)) assumes the form:

$$\begin{aligned} E&=\lim _{\mathbf{t} \rightarrow \infty }\frac{1}{2\mathbf{t}}\sum _{k=1}^{\mathbf{t}}\sum _{\varvec{\lambda }}\{\mathbf{g}\varvec{_\lambda }(\mathbf{r}_\mathbf{k};\mathbf{w})-\mathbf{y}\varvec{_\lambda }_{\mathbf{k}}\}^2\\&=\frac{1}{2}\sum _{m}\int \int \{\mathbf{g}\varvec{_\lambda }(\mathbf{r}_\mathbf{k};\mathbf{w})-\mathbf{y}\varvec{_\lambda }_{\mathbf{k}}\}^2p(\mathbf{y}\varvec{_\lambda },\mathbf{r})d\mathbf{y}\varvec{_\lambda } d\mathbf{r}\\&=\frac{1}{2}\sum _{m}\int \int \{\mathbf{g}\varvec{_\lambda }(\mathbf{r}_\mathbf{k};\mathbf{w})-\mathbf{y}\varvec{_\lambda }_{\mathbf{k}}\}^2p(\mathbf{y}\varvec{_\lambda }|\mathbf{r})p(\mathbf{r})d\mathbf{y}\varvec{_\lambda } d\mathbf{r} \end{aligned}$$

Let \(\varvec{\eta }\) be a random vector describing the input data with probability distribution \(p(\varvec{\eta })\). In most of the cases, noise distribution is chosen to have zero mean (\(\int \varvec{\eta }_ip(\varvec{\eta })d\varvec{\eta }=0\)) and to be uncorrelated (\(\int \varvec{\eta }_i\varvec{\eta }_jp(\varvec{\eta })d\varvec{\eta }=\text {variance}\sigma _{ij}\)). In cases where each input data point contains additional noise and is repeated infinite times, the error function over the expanded data can be written as:

$$\begin{aligned} \widetilde{E}&=\frac{1}{2}\sum _{m}\int \int \int \{\mathbf{g}\varvec{_\lambda }((\mathbf{r}_\mathbf{t};\mathbf{w})+ \varvec{\eta })-\mathbf{y}\varvec{_\lambda }_{\mathbf{k}}\}^2 \\&p(\mathbf{y}\varvec{_\lambda } \mid \mathbf{r})p(\mathbf{r})p(\varvec{\eta })d\mathbf{y}\varvec{_\lambda } d\mathbf{r}d\varvec{\eta } \end{aligned}$$

Expanding the network function as a Taylor series in powers of \(\eta \) produces:

$$\begin{aligned} \mathbf{g}\varvec{_\lambda }((\mathbf{r}_\mathbf{t};\mathbf{w})+&\varvec{\eta })=\mathbf{g}\varvec{_\lambda }(\mathbf{r}_\mathbf{t};\mathbf{w})+\sum _{i}\varvec{\eta }_i\frac{\partial \mathbf{g}\varvec{_{\lambda }}}{\partial \mathbf{r}_i}\biggm \vert _{\varvec{\eta }=0}+ \\&+\frac{1}{2}\sum _{i}\sum _{j}\varvec{\eta }_i\varvec{\eta }_j\frac{\partial ^2\mathbf{g}\varvec{_{\lambda }}}{\partial \mathbf{r}_i \partial \mathbf{r}_j}\biggm \vert _{\varvec{\eta }=0}+\mathcal {O}(\varvec{\eta }^3) \end{aligned}$$

By substituting the Taylor series expansion into the error function we obtain the following form of regularization term that governs the Tikhonov regularization:

$$\begin{aligned} \widetilde{E}=E+variance \times \varOmega \end{aligned}$$


$$\begin{aligned} \varOmega =\frac{1}{2}\sum _{m}\sum _{i}\int \int&\left\{ (\frac{\partial \mathbf{g}\varvec{_\lambda }}{\partial \mathbf{r}_i})^2 +\frac{1}{2}\{\mathbf{g}\varvec{_\lambda }(\mathbf{r})-\mathbf{y}\varvec{_\lambda }\}\frac{\partial ^2\mathbf{g}\varvec{_\lambda }}{\partial \mathbf{r}_{i}^{2}} \right\} \\&p(\mathbf{y}\varvec{_\lambda }|\mathbf{r})p(\mathbf{r})d\mathbf{r}d\mathbf{y}\varvec{_\lambda } \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kouskouridas, R., Charalampous, K. & Gasteratos, A. Sparse pose manifolds. Auton Robot 37, 191–207 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Object manipulation
  • Sparse representation
  • Manifold modeling
  • Neural networks