The Challenge of Anticipation pp 161-184

Part of the Lecture Notes in Computer Science book series (LNCS, volume 5225)

A Reinforcement-Learning Model of Top-Down Attention Based on a Potential-Action Map

  • Dimitri Ognibene
  • Christian Balkenius
  • Gianluca Baldassarre
Chapter

Abstract

How can visual selective attention guide eye movements so as to collect information and identify targets potentially relevant for action? Many models have been proposed that use the statistical properties of images to create a dynamic bottom-up saliency map used to guide saccades to potentially relevant locations. Since the concept of saliency map was introduced, it has been incorporated in a large number of models and theories (Rao and Ballard, 1995; Itti and Koch, 2001a; Rao et al., 2002; de Brecht and Saiki, 2006; Hoffman et al., 2006; Singh et al., 2006; Walther and Koch, 2006; Chen and Kaneko, 2007; Shi and Yang, 2007; Siagian and Itti, 2007). Saliency maps have shown to be useful both as models of human attention and for technical applications (Balkenius et al., 2004).

These bottom-up mechanisms have been enhanced with top-down processes in models that learn to move the eye in search of the target on the basis of foveated objects. In many of these systems, top-down attention is guided by task-related information that is acquired through automatic learning procedures (Dayan et al., 2000b). For example, Schmidhuber and Huber (1991) built an artificial fovea controlled by an adaptive neural controller. Q-learning was used in the model of Goncalves et al. (1999) to control attention based on multimodal input and reinforcement signals. Another model that uses reinforcement learning to control visual attention is described by Minut and Mahadevan (2001). In this model a first component learns by reinforcement learning to direct the gaze to relevant points in space, whereas a second component performs a “within fixation” processing directed to analyse the foveated space and identify targets. Reinforcement learning was also used by Shibata et al. (1995) to control the movement of a visual sensor over an image. The goal of the system was to find the optimal fixation point for object recognition. In this model, the same neural network was used both for object recognition and to produce the sensory motion output. Balkenius (2000) presented a model that uses instrumental conditioning as a basis for learned saccade movements. This model was later extended to support contextual cueing where several visual stimuli together suggest the location of a target (Balkenius, 2003). However, this model could only keep one potential target location active at each time.

Here we propose a novel model that improves on this type of top-down mechanisms by using an eye-centred potential-action map (PAM). The PAM keeps track of all the potential locations of targets based on the information contained in a sequence of fixations (cf. Chen and Kaneko, 2007). In this respect, the PAM works as a short term memory for potential target locations. Each fixation suggests potential locations for targets or other relevant cues and the evidence for each possible location is accumulated in the PAM. The location of the potential target locations are based on both the identity of the currently fixated object and its spatial location (Deco and Rolls, 2005). A shift mechanism triggered by eye movements allows the potential target locations activated in the PAM to be always updated with respect to the location of the current fixation (similar mechanisms might be used by real brains, cf. Gnadt and Andersen, 1988; Dominey and Arbib, 1992; Pouget et al., 2000; Di Ferdinando et al., 2004; Shadmehr and Wise, 2005). Overall, the PAM makes up an efficient mechanism for accumulating evidence for potential target locations in a action-oriented compact format readily usable for controlling eye movements. As we shall see, the results reported here indicate that, thanks to the PAM, the model suitably integrates bottom-up and top-down attention mechanisms and outperforms simpler models that only search for targets based on a single, currently foveated object.

In contrasts to the majority of models tackling the object-localisation tasks, the system proposed here was designed not only to find the target, but also to stay on the target once found. This is accomplished with multiple saccades that keep the eye’s fixation point on the target. This combines the features of the cue-target based systems describe above and systems that are more directed toward tracking (e.g. Shibata and Schaal, 2001; Balkenius and Johansson, 2007). The idea underlying this functionality is that vision serves action, in particular that attentional selection is a precursor of action and it is intimately related to it (Allport, 1990; Ballard, 1991; Balkenius and Hulth, 1999; Castiello, 1999; Casarotti et al., 2003; Di Ferdinando et al., 2004). In this respect, the system presented here was designed to be used within a future architecture, which will guide a robotic arm engaged in reaching rewarded targets in space. As previous models (Ognibene et al., 2006; Herbort et al., 2007), within this architecture the targets of the arm’s reaching movements will be selected on the basis of a neural competitions fuelled by the information flow coming from perception, in a way similar to what happens in the primate brain (cf. Cisek and Kalaska, 2005). With respect to this mechanism of action selection, the capacity of the attentional system to keep the fixation point on the target will allow the model to bias the competition between alternative goals of the arm’s movements in favour of objects relevant to the system.

The rest of the paper is organised as follows. Section 8.2 will first illustrate in detail the architecture of the architecture proposed here and the detailed functioning and learning processes of its components, and then it will illustrate the tasks used to train and test the system. Section 8.3 will analyse in detail the function of the architecture’s components, in particular how the potential action map can keep a memory of the information returned by cues and can integrate information on the target returned by several cues. Finally, section 8.4 will illustrate the strengths of the architecture and the limitations of it which will be tackled in future work.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Dimitri Ognibene
    • 1
    • 2
  • Christian Balkenius
    • 3
  • Gianluca Baldassarre
    • 1
  1. 1.ISTC-CNR, Via S. Martino della Battaglia 44 - 00185 RomaItaly
  2. 2.DIST, Via all’Opera Pia 13 - 16145 GenovaItaly
  3. 3.Lund University Cognitive Science, Kungshuset, Lundagård SE - 222 22 LundSweden

Personalised recommendations