Skip to main content

Integrating planning perception and action for informed object search


This paper presents a method to reduce the time spent by a robot with cognitive abilities when looking for objects in unknown locations. It describes how machine learning techniques can be used to decide which places should be inspected first, based on images that the robot acquires passively. The proposal is composed of two concurrent processes. The first one uses the aforementioned images to generate a description of the types of objects found in each object container seen by the robot. This is done passively, regardless of the task being performed. The containers can be tables, boxes, shelves or any other kind of container of known shape whose contents can be seen from a distance. The second process uses the previously computed estimation of the contents of the containers to decide which is the most likely container having the object to be found. This second process is deliberative and takes place only when the robot needs to find an object, whether because it is explicitly asked to locate one or because it is needed as a step to fulfil the mission of the robot. Upon failure to guess the right container, the robot can continue making guesses until the object is found. Guesses are made based on the semantic distance between the object to find and the description of the types of the objects found in each object container. The paper provides quantitative results comparing the efficiency of the proposed method and two base approaches.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14


  1. For more deep reviews of visual attention models from psychological and neurobiological perspectives, refer to Rothenstein and Tsotsos (2008), Carrasco (2011), Borji and Itti (2013) and Tsotsos (2017).

  2. Let us assume that a robot located in a room \(r_1\) is supposed to approach a table \(t_1\), located in room \(r_2\) to fetch a bottle of water for a user. A possible plan could comprise, moving to room \(r_2\), then approaching table \(t_1\) and finally detecting a bottle of water on it. Let us also assume that another bottle of water gets into the field of view of the robot as it moves towards room \(r_2\). If and only if the bottle of water detector is activated before approaching table \(t_1\), it could be detected and the plan could be optimized using such bottle instead.


  • Aloimonos Y (1993) Active perception. Lawrence Erlbaum, Hillsdale

    Google Scholar 

  • Bissmarck F, Svensson M, Tolt G (2015) Efficient algorithms for next best view evaluation. In: IEEE/RSJ international conference on intelligent robots and systems

  • Borji A, Itti L (2013) State-of-the-art in visual attention modeling. IEEE Trans Pattern Anal Mach Intell 35(1):185–207

    Article  PubMed  Google Scholar 

  • Canziani A, Culurciello E (2015) Visual attention with deep neural networks. In: Information sciences and systems (CISS), 2015 49th annual conference on, pp. 1–3, March 2015

  • Carrasco M (2011) Visual attention: the past 25 years. Vis Res 51(13):1484–1525

    Article  PubMed  PubMed Central  Google Scholar 

  • Connolly C (1985) The determination of next best views. In: Robotics and automation. Proceedings. 1985 IEEE international conference on, vol 2, pp 432–435. IEEE

  • Egeth HE (1966) Parallel versus serial processes in multidimensional stimulus discrimination. Atten Percept Psychophys 1(4):245–252

    Article  Google Scholar 

  • Foote T (2013) TF: the transform library. In: Technologies for practical robot applications (TePRA), 2013 IEEE international conference on, open-source software workshop, pp 1–6, April 2013

  • Forssén P-E, Meger D, Lai K, Helmer S, Little JJ, Lowe DG (2008) Informed visual search: combining attention and object recognition. In: Robotics and automation, 2008. icra 2008. IEEE international conference on, pp 935–942. IEEE

  • Gutierrez MA, Banchs RE, D’Haro LF (2015) Perceptive parallel processes coordinating geometry and texture. In: Proceedings of Workshop on Multimodal Semantics for Robotic Systems 2015, Hamburg, pp 30–35

    Google Scholar 

  • Gutiérrez MA, Manso LJ, Pandya H, Núñez P (2017) A passive learning sensor architecture for multimodal image labeling: an application for social robots. Sensors 17(2):353

    Article  PubMed Central  Google Scholar 

  • He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. In arXiv:1512.03385

  • Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259

    Article  Google Scholar 

  • Lee S, Lim J, Suh IH (2015) Incremental learning from a single seed image for object detection. In: Intelligent robots and systems (IROS), 2015 IEEE/RSJ international conference on, pp 1905–1912. IEEE

  • Manso LJ et al (2010) RoboComp: a tool-based robotics framework. In: Simulation, modeling and programming for autonomous robots, pp 251–262. Springer

  • Manso LJ, Bustos P, Bachiller P, Núñez P (2015) A perception-aware architecture for autonomous robots. Int J Adv Robot Syst 12(174):13

    Google Scholar 

  • Manso LJ, Calderita LV, Bustos P, Bandera A (2016) Use and advances in the active grammar-based modeling architecture. In: Proceedings of the workshop of physical agents, pp 1–25

  • Martinez Mozos O, Chollet F, Murakami K, Morooka K, Tsuji T, Kurazume R, Hasegawa T (2012) Tracing commodities in indoor environments for service robotics. In: IFAC Proceedings Volumes, vol 45, Elsevier, pp 71–76

    Google Scholar 

  • Mikolov T, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems

  • Milliez G, Warnier M, Clodic A, Alami R (2014) A framework for endowing an interactive robot with reasoning capabilities about perspective-taking and belief management. In: The 23rd IEEE international symposium on robot and human interactive communication, pp 1103–1109. IEEE

  • Mnih V, Heess N, Graves A, Kavukcuoglu K (2015) Recurrent models of visual attention. In: Advances in neural information processing systems, vol 27

  • Müller HJ, Krummenacher J (2006) Visual search and selective attention. Vis Cognit 14(4–8):389–410

    Article  Google Scholar 

  • Pillai S, Leonard J (2015) Monocular slam supported object recognition. arXiv preprint arXiv:1506.01732

  • Quigley M et al (2009) ROS: an open-source robot operating system. In: Proc. of ICRA workshop on open source software

  • Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  • Rothenstein AL, Tsotsos JK (2008) Attention links sensing to recognition. Image Vis Comput 26(1):114–126

    Article  Google Scholar 

  • Rusu RB, Bradski G, Thibaux R, Hsu J (2010) Fast 3d recognition and pose using the viewpoint feature histogram. In: Intelligent robots and systems (IROS), 2010 IEEE/RSJ international conference on, pp 2155–2162. IEEE

  • Sternberg S et al (1966) High-speed scanning in human memory. Science 153(3736):652–654

    Article  PubMed  CAS  Google Scholar 

  • Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cognit Psychol 12(1):97–136

    Article  PubMed  CAS  Google Scholar 

  • Tsotsos JK (2017) Attention and cognition: principles to guide modeling. In: Computational and cognitive neuroscience of vision, pp 277–295. Springer

  • Van der Maaten L, Hinton G (2012) Visualizing non-metric similarities in multiple maps. Mach Learn 87(1):33–55

    Article  Google Scholar 

  • Wallenberg M, Forssén P-E (2010) Embodied object recognition using adaptive target observations. Cognit Comput 2(4):316–325

    Article  Google Scholar 

  • Walther D, Rutishauser U, Koch C, Perona P (2005) Selective visual attention enables learning and recognition of multiple objects in cluttered scenes. Comput Vis Image Underst 100(1):41–63

    Article  Google Scholar 

  • Wolfe JM, Gray W (2007) Guided search 4.0. Integrated models of cognitive systems, pp 99–119

  • Xu K, Ba J, Kiros R, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044,

Download references


This work has been partially supported by the MICINN Project TIN2015-65686-C5-5-R, by the Extremaduran Government Project GR15120, by the Red de Excelencia “Red de Agentes Físicos” TIN2015-71693-REDT and by MEC project PHBP14/00083. Funding was provided by Junta de Extremadura (Ayudas Consolidación Grupos Investigación Catalogados).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Luis J. Manso.

Additional information

Handling editor: Antonio Bandera (University of Malaga); Reviewers: David Meger (McGill University), Antonio Palomino (Fundación Magtel).

This article is part of the Special Issue on ‘Cognitive Robotics’ guest-edited by Antonio Bandera, Jorge Dias, and Luis Manso.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Manso, L.J., Gutierrez, M.A., Bustos, P. et al. Integrating planning perception and action for informed object search. Cogn Process 19, 285–296 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Active perception
  • Informed search
  • Perception-aware planning