Basal Ganglia System as an Engine for Exploration
KeywordsBasal Ganglion Chronic Fatigue Syndrome Reinforcement Learning Indirect Pathway Direct Pathway
The basal ganglia (BG) system is a deep brain circuit with wide-ranging brain functions. Exploration refers to the sampling of a variety of behaviors not firmly established within a learned repertoire. While the neural source of variability driving exploration within the subcortex has not been identified, the hypothesis that the indirect pathway of the BG is the subcortical substrate for exploration leads to explanations for how a range of putative BG functions might be performed.
Reinforcement Learning and the Basal Ganglia
For nearly a century, a certain “mysteriousness” has been attributed to the function of the basal ganglia (BG) system – a deep brain circuit of multiple interconnected nuclei, with rich connections to large parts of the cortex (Kinnier Wilson in his Croonian lectures in 1925, Marsden 1982). The mystique surrounding BG has its roots perhaps in the multifarious functions of this circuit. Action selection, action gating, sequence generation, motor preparation, reinforcement learning, timing, working memory, goal-directed behavior, and exploratory behavior – the list of putative BG functions is long and perhaps, by the current state of knowledge, is also incomplete. Lesions of this circuit can show manifestations in many forms – from simple reaching movements to handwriting, balancing and gait, speech and language function, eye movements, and force generation, in addition to cognitive and affective manifestations. A long line of neurological (Parkinson’s disease, Huntington’s disease, athetosis, chronic fatigue syndrome) (DeLong 1990) and neuropsychiatric disorders (schizophrenia, obsessive–compulsive disorder, ADHD, apathy, abulia, insomnia) (Ring and Serra-Mestres 2002) are associated with BG impairment.
A need to go beyond the simple Go/NoGo picture of BG function had its seeds in experiments (Houk et al. 1995; Schultz et al. 1997) on the firing properties of mesencephalic dopamine cells. Although activities of dopaminergic cells have been linked to reward sensing for a long time, experiments by Schultz et al. (1997) specifically showed that dopamine neurons of ventral tegmental area (VTA) respond to unconditional rewards (food or juice). When a sensory stimulus (like a sound or a light flash) consistently precedes the appearance of reward such that the stimulus is predictive of the reward, then dopamine cells fire in response to the stimulus and also the reward. Such findings led to the insight that dopamine cell activity is analogous to a quantity known as temporal difference error (TD error) that appears in reinforcement learning (RL) theory – a branch of machine learning (Sutton and Barto. 1998). Recognition of the analogy between mesencephalic dopamine signals and TD error signals of RL has inspired a much larger effort to draw parallels between other elements of RL theory and anatomical components of BG. Although the effort to explain various functions of BG using RL concepts is a story in the making, it is believed that RL holds the promise to create a comprehensive theory of BG in the long term (Chakravarthy et al. 2010).
RL theory describes how an agent can learn correct stimulus–response (SR) relationships using reward feedback from the environment. For a given stimulus, responses that yield rewards are reinforced, while those that result in punishment are attenuated. The problem is often complicated by the fact that rewards come after a delay following the response or even after a whole series of responses. The agent then needs a surrogate to reward, which guides its responses in the intervening period. RL theory proposes the value function as such a surrogate; a computational module called the critic computes value. (The module that performs actions is known as the actor.) Value is defined as the total discounted future reward that an agent expects to receive from a given state. Thus, once the value is known, at any instant, the agent chooses the response or action that brings about the greatest increase in value, a process known as exploitation. Sometimes it is desirable for the agent to try out actions that are not optimal, by way of adapting to the changing reward patterns of the real world. This selection of suboptimal actions, typically stochastically, is known as exploration. Thus, exploitation and exploration are two key complementary processes, the yin and yang, of RL theory (Sutton and Barto. 1998).
The Indirect Pathway and Exploration
In actor/critic (AC) models of BG, the emphasis is often on the respective substrates for the actor and critic components in the striatum (Joel et al. 2002), and the dopamine signal for its role in training the AC components. Thus, though exploitation and exploration are complementary processes, exploitation receives most of the attention. Such omission is perhaps not surprising as even in the AC framework, the actor and critic are recognized explicitly as modules, while exploration is a mere mechanism obtained by a “noise term” in RL equations. Since variability is ubiquitous in the brain – either arising due to thermal noise or chaotic neural dynamics – the search for a specific substrate for exploration in BG was felt unnecessary.
Experimental evidence particularly from functional neuroimaging seems to support this partial view, with a bias towards cortical substrates over subcortical ones. In fMRI studies, gamblers were asked to choose between slots that are expected to give highest rewards (“exploit”) and less familiar slots that might turn out to be more profitable (“explore”). The areas in the brain that are preferentially activated during exploitation or exploration are noted (Daw et al. 2006). While substrates for value computations were found in orbitofrontal cortex (Knutson et al. 2001), substrates for exploration were found in anterior frontopolar cortex and intraparietal sulcus (Daw et al. 2006). The anterior cingulate cortex (ACC) is suggested to be involved in balancing between exploitation and exploration (Rushworth and Behrens 2008). Studies by Yoshida and Ishii (2006) found activation in prefrontal cortex and ACC when subjects were exploring a maze (Yoshida and Ishii 2006). In the subcortex, it was suggested that the ventral and dorsal striata correspond to critic and actor, respectively (O’Doherty et al. 2004). Thus, though both cortical and subcortical substrates of exploitation have been discovered, no corresponding subcortical substrates for exploration have been found.
Can there be subcortical substrates for exploration? Stein et al. (1997) showed that decorticated kittens can exhibit exploratory and goal-oriented behavior (Stein et al. 1997). Rats with damaged STN were shown to exhibit perseverative behavior or reduced exploration of new options, with persistent selection of older unrewarding ones (Baunez et al. 2001). When bicuculline, a GABA antagonist, was injected into the anterior GPe of primates, the animals exhibited stereotypic movements, and when it was injected into dorsolateral GPe, the animal produced hyperactivity that included exploratory or searching movements for food (Grabli et al. 2004). Drawing inspiration from the studies of Usher and Cohen et al. (1999), Doya (2002) suggested a link between norepinephrine levels and the “inverse temperature” parameter which controls exploration in RL literature. It is noteworthy that globus pallidus is reported to have high norepinephrine levels (Russell et al. 1992).
Thus, it appears compelling that the STN–GPe system constituting the IP of BG might be the subcortical substrate for exploratory behavior. The STN–GPe system and its intriguing oscillatory activity do not seem to occupy a prominent place in AC modeling literature. On the other hand, there is an entire line of modeling work that presents the STN–GPe system as a pacemaker in the brain, in reference to its oscillatory activity (Ring and Serra-Mestres 2002; Willshaw and Li 2002). These oscillations have also been linked to Parkinsonian tremor (Hurtado et al. 1999; Terman et al. 2002). Though the aforementioned STN–GPe models explain the behavioral effects of pathological oscillations, they attribute no role to the oscillations in the RL framework that is thought to govern the processes of BG. Under dopamine-deficient or Parkinsonian conditions, the firing patterns of STN and GPe neurons show dramatically increased correlation without significant increase in firing rate (Bergman et al. 1994; Brown et al. 2001). Since exploration is driven by noise in RL models, a brain region that drives exploration is expected to be a source of noise generated perhaps by complex neural dynamics. Considering the low correlation in STN–GPe under normal conditions, with increased correlation or loss of complexity in pathology, it is plausible that the STN–GPe is a subcortical substrate for exploration.
By an extended application of RL concepts, a comprehensive model of BG can be built in which the exploitative dynamics of the DP can be combined with STN–GPe oscillations that drive exploratory behavior (Chakravarthy et al. 2010). Thus emerges a view that while DP supports exploitation, IP subserves exploration, differing from the classical Go/NoGo view of BG. In a recent modeling study, it was shown that the exploitation (DP) versus exploration (IP) can be reconciled with Go (DP) versus NoGo (IP) view, by inserting a third regime dubbed the Explore regime. This regime would correspond to exploration and resides between the classic Go and NoGo regimes (Kalva et al. 2012). A series of BG models based on this view have been developed to account for a wide variety of BG-related motor and cognitive behaviors such as spatial navigation, saccades, reaching, and reward–punishment learning (Sridharan et al. 2006; Chakravarthy et al. 2010; Krishnan et al. 2011; Kalva et al. 2012; Priyadharsini et al. 2012; Gupta et al. 2013; Muralidharan et al. 2013).
The Basic Model
The nigrostriatal dopamine signal as TD error, using it to train corticostriatal connections
The action of dopamine in switching between DP and IP, via its differential action on the D1 and D2 receptors of striatal medium spiny neurons
Oscillations in the STN–GPe system
Value computations in the striatum
The classical Go (DP) and NoGo (IP) with the added “Explore” behavior.
The STN–GPe System
Increased striatal input to GPe: This property is corroborated by electrophysiological data from Bergman and Wichmann et al. (1994). Kravitz et al. (2010) observed that increased firing of D2R-MSNs in the striatum induce a state similar to Parkinson’s, with motor symptoms like freezing, bradykinesia, and difficulty in movement initiation (Kravitz et al. 2010).
Increasing cortical input to STN: Electrophysiological studies show that ablation of cortical areas that project to STN largely abolished low-frequency oscillations in STN–GPe (Magill et al. 2001).
Reducing dopamine levels in STN–GPe: Organotypic culture studies show that the STN–GPe system exhibits low-frequency oscillations under dopamine-deficient conditions as in that of Parkinson’s disease (Plenz and Kital 1999). STN and GPe oscillations seem to be triggered by the effect of dopamine loss on D2R which strengthens the STN–GPe coupling (Steiner and Tseng 2010).
If the STN–GPe system is to serve as a source of exploration, a high spatiotemporal complexity in STN activity manifested in the form of low pair-wise correlations among neurons is expected. The Figures 3 and 4 also illustrate the ability of ε s to control correlation within STN and show that ε s can be used to control the exploration in the BG model.
GPi combines the GABAergic striatal output via DP with glutamatergic STN output from IP. There is evidence that this combination of DP and IP outflows in GPi is modulated by dopamine projections to GPi. When D1R in GPi, primarily located on the GABAergic striato-pallidal axonal projections, is activated, firing levels of GPi neurons are reduced (Kliem et al. 2007). Since D1Rs are activated at increased dopamine levels, the facilitation of the DP outflow over IP at higher dopamine levels is consistent with the nature of switching facilitated by dopamine in the striatum.
Action Selection in Thalamus
If the primary function of the BG circuit is action selection, where in the circuit is the precise site of such selection? If action salience is computed in the striatum and STN–GPe provides exploration, action selection could be happening downstream in GPi or in the thalamic nuclei receiving afferents from GPi. The competitive dynamics of neurons of thalamic reticular complex makes them ideally suited for implementing action selection (Humphries and Gurney 2002). During binary action selection in the model, the GPi outputs to thalamus converge on two neurons that represent the two action alternatives. These two thalamic neurons integrate the GPi inputs through time: the one that crosses a preset threshold first wins the competition, while the second neuron is reset immediately. Accordingly, if x i Thal (t) > x th for i (= 1, 2) at time t, then the states of all the other thalamic neurons immediately reset when “i”th action being selected is expressed by x j Thal (t) = 0; j ≠ i. If all x i Thal (t) fail to reach x th , no action is selected.
Binary Action Selection
“Go” – winning neuron has greater salience.
“Explore” – winning neuron has lesser salience.
“NoGo” – no winner and therefore no action selection.
Modeling the N-Armed Bandit Problem
In the n-armed bandit problem, a generalization of the binary action selection problem, the setup consists of n slot machines each delivering a fixed reward (deterministically or probabilistically) on selection. The objective is to maximize the total reward received by the agent. Additional features that need to be added to the binary action selection for simulating the n-armed bandit problem are (1) value computation in the striatum, (2) feedback of previous action to the striatum, and (3) resolving the nigrostriatal signal into two dopamine signals – δTD and δV.
The difference between TD error (Eqs. 2 and 3) and value gradient (Eq. 5) is as follows. While TD error controls learning of corticostriatal weights (Eq. 4), value gradient controls the gain of D1R- and D2R-MSNs. D1R-MSNs are modeled to be activated at higher δV and D2R-MSNs at lower levels. Thus, δV controls exploration by determining the relative contributions of DP or IP.
A large positive δ v implies a large increase in value, which therefore recommends selection of the same action next time (Go), since the contribution of DP to GPi dominates that of STN. A large negative δ v implies a large reduction in value, suggesting exploration for new actions. Since the IP is selected for strong negative δ v, IP contribution dominates that of DP at GPi, thereby suppressing action (NoGo). For small magnitudes of δ v, DP is still reduced, and driven by the complex dynamics of STN, a random action is selected next time (Explore).
Climbing Value Gradient Using δV
It can be observed from the network model used for the n-armed bandit problem that the value function increases gradually when the action selected is iterated through the loop (thalamo-striatal) in Fig. 2. Thus, the network dynamics implements hill-climbing over the value function, which is a form of stochastic hill-climbing, thanks to the complex dynamics of STN–GPe system. In fact, the aforementioned effect of δ v on value change (by selecting the previous/random/no action for large positive/moderate/large negative values of δ v, respectively) is strongly reminiscent of simulated annealing – a form of stochastic optimization (Kirkpatrick, Jr. et al. 1983). It is noted that the BG model exhibits three behaviors depending on dopamine: (1) Go regime (“repeat the previous action”) for large positive δ v, (2) Explore regime (“try random actions”) for intermediate values of δ v, and (3) NoGo regime (“no action”) for large negative values of δ v. These regimes inspire a simple mechanism of hill-climbing in continuous action spaces as follows.
Equation 9, known as the GEN rule, represents an abstract, summarized representation of how BG selects actions based on DA signals. The GEN rule or the approach that it is based on has been applied successfully to model a range of BG functions.
Application of RL concepts to BG function (Houk et al. 1995; Schultz et al. 1997; Hollerman and Schultz 1998) sought a revision of the Go/NoGo picture as the models may not have given sufficient attention to exploration – the complementary process to exploitation. The Go/NoGo picture of BG can explain how the BG circuit can learn simple binary action selection using RL, but it is inadequate to know how the BG circuit can solve more challenging RL problems in continuous state and action spaces.
The binary Go/NoGo thinking about BG function is supported by a simplistic interpretation of the functional neurochemistry of the two BG pathways (Albin et al. 1989; Contreras-Vidal and Stelmach 1995). But presence of the feedback from STN to GPe allows the possibility of complex dynamics in the STN–GPe loop, thereby introducing an added complication in our functional understanding of BG pathways. The STN–GPe loop has also been dubbed as the “pacemaker” of BG considering its role in generating pathological oscillations associated with Parkinsonian tremor (Hurtado et al. 1999; Terman et al. 2002). This STN–GPe system is an excitatory–inhibitory pair that is capable of exhibiting oscillations and other forms of complex dynamics (Brunel 2000). The fact that neurons in this system exhibit uncorrelated firing patterns in normal conditions, and highly correlated and synchronized firing under dopamine-deficient pathological conditions, seems to offer an important clue to the possible role of this circuit in exploration. From the above studies, it can be concluded that the STN–GPe system is in the best position to serve as an explorer, thereby supplying the missing piece in the RL-machinery of BG. Behavioral strategies of reaching, saccades, spatial navigation, gait, and willed action are shown to be better modeled using the above theory as follows.
A model of reaching movements highlighting the role of BG was described in Magdoom and Subramanian et al. (2011), where a neural network representing motor cortex is trained to drive a two-joint arm to a target. The output of the BG that is predominant in early stages of learning is combined with that of motor cortex whose relative contribution grows with learning. The BG dynamics governed by the earlier described GEN rule discovers desired activations which are used by the motor cortex for learning. When the dopamine signal is clamped to reflect dopamine deficiency in Parkinsonian conditions, the model exhibited Parkinsonian features in reaching like bradykinesia, undershoot, and tremor.
The idea that the DP and IP subserve exploitation and exploration respectively was used in a model of BG on saccade generation (Krishnan et al. 2011) when applied to standard visual search tasks like feature and conjunction search, directional saccades, and sequential saccades. On simulating Parkinsonian conditions by diminished BG output, the model exhibited impaired visual search with longer reaction times – a characteristic symptom in Parkinson’s disease (PD) patients.
The GEN approach (Eqs. 6, 8, and 9) was used to model the relative contributions of BG and hippocampus to spatial navigation (Sukumar et al. 2012). The model combines two navigational system: the cue-based system subserved by BG and the place-based system by hippocampus. The two navigational systems are associated with their respective value functions that are combined by the softmax policy (Sutton and Barto. 1998) to select the next move. This model describes the results of an experimental study that investigates competition between cue-based and place-based navigation (Devan and White 1999). Under dopamine-deficient conditions, the model exhibited longer escape latencies similar to PD-model rats (Miyoshi et al. 2002).
The GEN approach to BG was also applied to model impaired gait patterns in PD (Muralidharan et al. 2013). Studies by Cowie and Limousin et al. (2010) investigated gait changes, as PD patients walked through a narrow doorway and observed a strong dip in velocity a short distance from the doorway. In the model of Muralidharan and Balasubramani et al. (2013), the simulated agent passed through the doorway without any significant velocity dip under control conditions and exhibited a significant reduction in velocity close to the doorway under PD conditions.
Clinical literature shows that BG has a role in willed action, and its impairment is seen in conditions of BG lesions or diseases like Parkinson’s disease that affect BG (Mink 2003). Recently it was suggested that the BG circuit amplifies will signals, presumably weak, by a stochastic resonance process (Chakravarthy 2013). This study shows that the GEN policy, which is a combination of deterministic hill-climbing process and a stochastic process, may be reinterpreted as a form of stochastic resonance. Applying the model to a simple reaching task, it was shown that the arm reaches the target with probability close to unity at optimal noise levels. The arm dynamics for subthreshold noise is reminiscent of Parkinsonian akinesia, whereas for superthreshold noise the arm shows uncontrolled movements resembling Parkinsonian dyskinesias.
A perspective that the BG circuit is an exploration engine, carved out of the popular RL approach to BG modeling, can thus be substantiated.
- Gillies A, Willshaw D et al (2002) Functional interactions within the subthalamic nucleus. The basal ganglia VII. Springer, New York. pp 359–368Google Scholar
- Houk JC, Davis JL et al (1995) Models of information processing in the basal ganglia. The MIT press, Cambridge, MAGoogle Scholar
- Priyadharsini BP, Ravindran B et al (2012) Understanding the role of serotonin in basal ganglia through a unified model. Artificial neural networks and machine learning–ICANN 2012, Springer, Berlin, pp 467–473Google Scholar
- Stein PS, Grillner S et al (1997) Neurons, networks, and behavior. MIT Press, Cambridge, MAGoogle Scholar
- Steiner H, Tseng KY (2010) Handbook of basal ganglia structure and function: a decade of progress. Access online via Elsevier. Academic press, San DiegoGoogle Scholar
- Sutton R, Barto A (1998) Reinforcement learning: an introduction. Adaptive computations and machine learning. MIT Press/Bradford, Cambridge, MAGoogle Scholar