# Exploiting submodular value functions for scaling up active perception

**Part of the following topical collections:**

## Abstract

In *active perception* tasks, an agent aims to select sensory actions that reduce its uncertainty about one or more hidden variables. For example, a mobile robot takes sensory actions to efficiently navigate in a new environment. While partially observable Markov decision processes (POMDPs) provide a natural model for such problems, *reward functions* that directly penalize uncertainty in the agent’s belief can remove the piecewise-linear and convex (PWLC) property of the *value function* required by most POMDP planners. Furthermore, as the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially with it, making POMDP planning infeasible with traditional methods. In this article, we address a twofold challenge of modeling and planning for active perception tasks. We analyze \(\rho \)POMDP and POMDP-IR, two frameworks for modeling active perception tasks, that restore the PWLC property of the value function. We show the mathematical equivalence of these two frameworks by showing that given a \(\rho \)POMDP along with a policy, they can be reduced to a POMDP-IR and an equivalent policy (and vice-versa). We prove that the value function for the given \(\rho \)POMDP (and the given policy) and the reduced POMDP-IR (and the reduced policy) is the same. To efficiently plan for active perception tasks, we identify and exploit the independence properties of POMDP-IR to reduce the computational cost of solving POMDP-IR (and \(\rho \)POMDP). We propose greedy point-based value iteration (PBVI), a new POMDP planning method that uses *greedy maximization* to greatly improve scalability in the action space of an active perception POMDP. Furthermore, we show that, under certain conditions, including *submodularity*, the value function computed using greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. We establish the conditions under which the value function of an active perception POMDP is guaranteed to be *submodular*. Finally, we present a detailed empirical analysis on a dataset collected from a multi-camera tracking system employed in a shopping mall. Our method achieves similar performance to existing methods but at a fraction of the computational cost leading to better scalability for solving active perception tasks.

### Keywords

Sensor selection Long-term planning Mobile sensors Submodularity POMDP## 1 Introduction

*Multi-sensor systems* are becoming increasingly prevalent in a wide-range of settings. For example, multi-camera systems are now routinely used for security, surveillance and tracking (Kreucher et al. 2005; Natarajan et al. 2012; Spaan et al. 2015). A key challenge in the design of these systems is the efficient allocation of scarce resources such as the bandwidth required to communicate the collected data to a central server, the CPU cycles required to process that data, and the energy costs of the entire system (Kreucher et al. 2005; Williams et al. 2007; Spaan and Lima 2009). For example, state of the art human activity recognition algorithms require high resolution video streams coupled with significant computational resources. When a human operator must monitor many camera streams, displaying only a small number of them can reduce the operator’s cognitive load. IP-cameras connected directly to a local area network need to share the available bandwidth. Such constraints gives rise to the *dynamic sensor selection* problem where an agent at each time step must select *K* out of the *N* available sensors to allocate these resources to, where *K* is the maximum number of sensors allowed given the resource constraints (Satsangi et al. 2015).^{1}

For example, consider the *surveillance task*, in which a mobile robot aims to minimize its future uncertainty about the state of the environment but can use only *K* of its *N* sensors at each time step. Surveillance is an example of an *active perception* task, where an agent takes actions to reduce uncertainty about one or more hidden variables, while reasoning about various resource constraints (Bajcsy 1988). When the state of the environment is static, a *myopic* approach that always selects actions that maximize the immediate expected reduction in uncertainty is typically sufficient. However, when the state changes over time, a *non-myopic* approach that reasons about the long term effects of action selection performed at each time step can be better. For example, in the surveillance task, as the robot moves and the state of the environment changes, it becomes essential to reason about the long term consequences of the robot’s actions to minimize the future uncertainty.

A natural decision-theoretic model for such an approach is the *partially observable Markov decision process* (POMDP) (Sondik 1971; Kaelbling et al. 1998; Kochenderfer 2015). POMDPs provide a comprehensive and powerful framework for planning under uncertainty. They can model the dynamic and partially observable state and express the goals of the systems in terms of rewards associated with state-action pairs. This model of the world can be used to compute closed-loop, long term policies that can help the agent to decide what actions to take given a belief about the state of the environment (Burgard et al. 1997; Kurniawati et al. 2011).

In a typical POMDP reducing uncertainty about the state is only *a means to an end*. For example, a robot whose goal is to reach a particular location may take sensing actions that reduce its uncertainty about its current location because doing so helps it determine what future actions will bring it closer to its goal. By contrast, in active perception problems reducing uncertainty is *an end in itself*. For example, in the surveillance task, the system’s goal is typically to ascertain the state of its environment, not use that knowledge to achieve a goal. While perception is arguably always performed to aid decision-making, in an active perception problem that decision is made by another agent such as a human, that is not modeled as a part of the POMDP. For example, in the surveillance task, the robot might be able to detect a suspicious activity but only the human users of the system may decide how to react to such an activity.

One way to formulate uncertainty reduction as an end in itself is to define a *reward function* whose additive inverse is some measure of the agent’s uncertainty about the hidden state, e.g., the *entropy* of its *belief*. However this formulation leads to a reward function that conditions on the belief, rather than the state and the resulting *value function* is not PWLC, which makes many traditional POMDP solvers inapplicable. There exist *online planning methods* (Silver and Veness 2010; Bonet and Geffner 2009) that generate policies on the fly that do not require the PWLC property of the value function. However, many of these methods require multiple ‘hypothetical’ belief updates to compute the optimal policy, which makes them unsuitable for sensor selection where the optimal policy must be computed in a fraction of a second. There exist other online planning methods that do not require hypothetical belief updates (Silver and Veness 2010), but since we are dealing with belief based rewards, they cannot be directly applied here. Here, we address the case of *offline planning* where the policy is computed before the execution of the task.

Thus, to efficiently solve active perception problems, we must (a) model the problem with minimizing uncertainty as the objective while maintaining a PWLC value function and (b) use this model to solve the POMDP efficiently. Recently, two frameworks have been proposed, \(\rho \)*POMDP* (Araya-López et al. 2010) and *POMDP with Information Reward* (POMDP-IR) (Spaan et al. 2015) to efficiently model active perception tasks, such that the PWLC property of the value function is maintained. The idea behind \(\rho \)POMDP is to find a PWLC approximation to the “true” continuous belief-based reward function, and then solve it with the traditional solvers. POMDP-IR, on the other hand, allows the agent to make predictions about the hidden state and the agent is rewarded for accurate predictions via a state-based reward function. There is no research that examines the relationship between these two frameworks, their pros and cons, or their efficacy in realistic tasks, thus it is not clear how to choose between these two frameworks to model the active perception problems.

In this article, we address the problem of efficient modeling and planning for active perception tasks. First, we study the relationship between \(\rho \)POMDP and POMDP-IR. Specifically, we establish *equivalence* between them by showing that any \(\rho \)POMDP can be reduced to a POMDP-IR (and vice-versa) that preserves the value function for equivalent policies. Having established the theoretical relationship between \(\rho \)POMDP and POMDP-IR, we model the surveillance task as a POMDP-IR and propose a new method to solve it efficiently by exploiting a simple insight that lets us decompose the maximization over prediction actions and normal actions while computing the value function.

Although POMDPs are computationally difficult to solve, recent methods (Littman 1996; Hauskrecht 2000; Pineau et al. 2006; Spaan and Vlassis 2005; Poupart 2005; Ji et al. 2007; Kurniawati et al. 2008; Shani et al. 2013) have proved successful in solving POMDPs with large state spaces. Solving active perception POMDPs pose a different challenge: as the number of sensors grows, the size of the action space \(\left( {\begin{array}{c}N\\ K\end{array}}\right) \) grows exponentially with it. Current POMDP solvers fail to address the scalability in the action space of a POMDP. We propose a new *point-based* planning method that scales much better in the number of sensors for such POMDPs. The main idea is to replace the maximization operator in the Bellman optimality equation with *greedy maximization* in which a subset of sensors is constructed iteratively by adding the sensor that gives the largest marginal increase in value.

We present theoretical results bounding the error in the value functions computed by this method. We prove that, under certain conditions including *submodularity*, the value function computed using POMDP backups based on greedy maximization has bounded error. We achieve this by extending the existing results (Nemhauser et al. 1978) for the greedy algorithm, which are valid only for a single time step, to a full sequential decision making setting where the greedy operator is employed multiple times over multiple time steps. In addition, we show that the conditions required for such a guarantee to hold are met, or approximately met, if the reward is defined using negative belief entropy.

Finally, we present a detailed empirical analysis on a real-life dataset from a multi-camera tracking system installed in a shopping mall. We identify and study the critical factors relevant to the performance and behavior of the agent in active perception tasks. We show that our proposed planner outperforms a myopic baseline and nearly matches the performance of existing point-based methods while incurring only a fraction of the computational cost, leading to much better scalability in the number of cameras.

## 2 Related work

Sensor selection as an active perception task has been studied in many contexts. Most work focus on either open-loop or myopic solutions, e.g., Kreucher et al. (2005), Spaan and Lima (2009), Williams et al. (2007), Joshi and Boyd (2009). Kreucher et al. (2005) proposes a Monte-Carlo approach that mainly focuses on a myopic solution. Williams et al. (2007) and Joshi and Boyd (2009) developed planning methods that can provide long-term but open-loop policies. By contrast, a POMDP-based approach enables a closed-loop, non-myopic approach that can lead to a better performance when the underlying state of the world changes over time. Spaan (2008), Spaan and Lima (2009), Spaan et al. (2010) and Natarajan et al. (2012) also consider a POMDP-based approach to active perception and cooperative active perception. However, they consider an objective function that conditions on the state and not on the belief, as the belief-dependent rewards in POMDP break the PWLC property of the value function. They use point-based methods (Spaan and Vlassis 2005) for solving the POMDPs. While recent point-based methods (Shani et al. 2013) for solving POMDPs scale reasonably in the state space of the POMDPs, they do not address the scalability in the action and observation space of a POMDP.

In recent years, applying greedy maximization to submodular functions has become a popular and effective approach to sensor placement/selection (Krause and Guestrin 2005, 2007; Kumar and Zilberstein 2009; Satsangi et al. 2016). However, such work focuses on myopic or fully observable settings and thus does not enable the long-term planning required to cope with the dynamic state in a POMDP.

*Adaptive submodularity* (Golovin and Krause 2011) is a recently developed extension that addresses these limitations by allowing action selection to condition on previous observations. However, it assumes a static state and thus cannot model the dynamics of a POMDP across timesteps. Therefore, in a POMDP, adaptive submodularity is only applicable *within* a timestep, during which state does not change but the agent can sequentially add sensors to a set. In principle, adaptive submodularity could enable this intra-timestep sequential process to be adaptive, i.e., the choice of later sensors could condition on the observations generated by earlier sensors. However, this is not possible in our setting because (a) we assume that, due to computational costs, all sensors must be selected simultaneously; (b) information gain is not known to be adaptive submodular (Chen et al. 2015). Consequently, our analysis considers only classic, non-adaptive submodularity.

To our knowledge, our work is the first to establish the sufficient conditions for the submodularity of POMDP value functions for active perception POMDPs and thus leverage greedy maximization to scalably compute bounded approximate policies for dynamic sensor selection modeled as a full POMDP.

## 3 Background

In this section, we provide background on POMDPs, active perception POMDPs and solution methods for POMDPs.

### 3.1 Partially observable Markov decision processes

*R*(

*s*,

*a*), and the system transitions to a new state \(s' \in S\) according to the transition function \(T(s,a,s') = \Pr (s'|s,a)\). Then, the agent receives an observation \(z \in \varOmega \) according to the observation function \(O(s',a,z) = \Pr (z|s',a)\). Starting from an initial belief \(b_0\), the agent maintains a

*belief*

*b*(

*s*) about the state which is a probability distribution over all the possible states. The number of time steps for which the decision process lasts, i.e., the horizon is denoted by

*h*. If the agent takes an action

*a*in belief

*b*and gets an observation

*z*, then then the updated belief \(b^{a,z}(s)\) can be computed using Bayes rule. A policy \(\pi \) specifies how the agent acts in each belief. Given

*b*(

*s*) and

*R*(

*s*,

*a*), one can compute a belief-based reward, \(\rho (b,a)\) as:

*t*-step

*value function*of a policy \(V^{\pi }_{t}\) is defined as the expected future discounted reward the agent can gather by following \(\pi \) for next

*t*steps. \(V^{\pi }_{t}\) can be characterized recursively using the

*Bellman equation*:

*a*and following \(\pi \) thereafter:

*optimal policy*\(\pi ^{*}\) and the corresponding value function is called the

*optimal value function*\(V^{*}_{t}\). The

*optimal value function*\(V^{*}_{t}(b)\) can be characterized recursively as:

*Bellman optimality operator*\(\mathfrak {B}^*\):

An important consequence of these equations is that the value function is *piecewise-linear and convex* (PWLC), as shown in Fig. 1, a property exploited by most POMDP planners. Sondik (1971) showed that a PWLC value function at any finite time step *t* can be expressed as a set of vectors: \(\varGamma _{t} = \{ \alpha _{0}, \alpha _{1}, \dots , \alpha _{m} \}\). Each \(\alpha _i\) represents an |*S*|-dimensional hyperplane defining the value function over a bounded region of the belief space. The value of a given belief point can be computed from the vectors as: \(V_{t}^{*}(b) = \max _{\alpha _{i} \in \varGamma _t} \sum _{s} b(s)\alpha _{i}(s)\).

### 3.2 POMDP solvers

Exact methods like Monahan’s enumeration algorithm (Monahan 1982) computes the value function for all possible belief points by computing the optimal \(\varvec{\Gamma }_{t}\). Point-based planners (Pineau et al. 2006; Shani et al. 2013; Spaan and Vlassis 2005), on the other hand, avoid the expense of solving for all belief points by computing \(\varGamma _{t}\) only for a set of sampled beliefs *B*. Since the exact POMDP solvers (Sondik 1971; Monahan 1982) are intractable for all but the smallest POMDPs, we focus on point-based methods here. Point-based methods compute \(\varGamma _{t}\) using the following recursive algorithm.

*a*and observation

*z*, an intermediate \(\varGamma ^{a,z}_{t}\) is computed from \(\Gamma _{t-1}\):

*t*generates \(|A_n||\varOmega ||\varGamma _{t-1}|\) alpha vectors in \(\mathcal {O}(|S|^2 |A||\varOmega ||\varGamma _{t-1}|)\) time and then reduces them to |

*B*| vectors in \(\mathcal {O}(|S||B||A||\varOmega ||\varGamma _{t-1}|)\) (Pineau et al. 2006).

## 4 Active perception POMDP

The goal in an active perception POMDP is to reduce uncertainty about a *feature of interest* that is not directly observable. In general, the feature of interest may be only a part of the state, e.g., if a surveillance system cares only about people’s positions, not their velocities, or higher-level features derived from the state. However, for simplicity, we focus on the case where the feature of interest is just the state *s*^{2} of the POMDP. For simplicity, we also focus on *pure* active perception tasks in which the agent’s only goal is to reduce uncertainty about the state, as opposed to hybrid tasks where the agent may also have other goals. For such cases, *hybrid* rewards (Eck and Soh 2012), which combine the advantage of belief-based and state-based rewards, are appropriate. Although not covered in this article, it is straightforward to extend our results to hybrid tasks (Spaan et al. 2015).

Actions \(\mathbf a = \langle a_{1} \dots a_{N} \rangle \) are vectors of

*N*binary*action features*, each of which specifies whether a given sensor is selected or not. For each \(\mathbf a \), we also define its set equivalent \(\mathfrak {a} = \{i : a_i = 1\}\), i.e., the set of indices of the selected sensors. Due to the resource constraints, the set of all actions \(A = \{\mathfrak {a} : |\mathfrak {a}| \le K \}\) contains only sensor subsets of size*K*or less. \(A^+=\{1,\ldots ,N\}\) indicates the set of all sensors.Observations \(\mathbf {z} = \langle z_{1} \dots z_{N} \rangle \) are vectors of

*N**observation features*, each of which specifies the sensor reading obtained by the given sensor. If sensor*i*is not selected, then \(z_i = \emptyset \). The set equivalent of \(\mathbf {z}\) is \(\mathfrak {z} = \{z_i : z_i \ne \emptyset \}\). To prevent ambiguity about which sensor generated which observation in \(\mathfrak {z}\), we assume that, for all*i*and*j*, the domains of \(z_i\) and \(z_j\) share only \(\emptyset \). This assumption is only made for notational convenience and does not restrict the applicability of our methods in any way.

*K*out of

*N*sensors. The transition function is thus independent of the actions, as selecting sensors cannot change the state. However, as we outline in Sect. 7.4, it is possible to extend our results to general active perception POMDPs with arbitrary transition functions, that can model, e.g., mobile sensors that, by moving, change the state.

A challenge in these settings is properly formalizing the reward function. Because the goal is to reduce the uncertainty, reward is a direct function of the belief, not the state, i.e., the agent has no preference for one state over another, so long as it knows what that state is. Hence, there is no meaningful way to define a state-based reward function \(R(s,\mathbf {a})\). Directly defining \(\rho (b,\mathbf {a})\) using, e.g., negative *belief entropy*: \(- H_{b}(s) = \sum _s b(s) \log (b(s))\) results in a value function that is not piecewise-linear. Since \(\rho (b,\mathbf {a})\) is no longer a convex combination of a state-based reward function, it is no longer guaranteed to be PWLC, a property most POMDP solvers rely on. In the following subsections, we describe two recently proposed frameworks designed to address this problem.

### 4.1 \(\rho \)POMDPs

*b*, not on \(\mathbf {a}\) and can be written as \(\rho (b)\). Given \(\varGamma _{\rho }\), \(\rho (b)\) can be computed as: \(\rho (b) = \max _{\alpha \in \varGamma _{\rho }} \sum _{s} b(s)\alpha (s)\). If the true reward function is not PWLC, e.g., negative belief entropy, it can be approximated by defining \(\varGamma _{\rho }\) as a set of vectors, each of which is a tangent to the true reward function. Figure 3 illustrates approximating negative belief entropy with different numbers of tangents.

Solving a \(\rho \)POMDP^{3} requires a minor change to the existing algorithms. In particular, since \(\varGamma _{\rho }\) is a set of vectors, instead of a single vector, an additional cross-sum is required to compute \(\varGamma _{t}^{\mathbf {a}}\): \(\varGamma _{t}^{\mathbf {a}} = \varGamma _{\rho } \oplus \varGamma _{t}^{\mathbf {a},\mathbf {z}_{1}} \oplus \varGamma _{t}^{\mathbf {a},\mathbf {z}_{2}} \oplus \dots \). Araya-López et al. (2010) showed that the error in the value function computed by this approach, relative to the true reward function, whose tangents were used to define \(\varGamma _{\rho }\), is bounded. However, the additional cross-sum increases the computational complexity of computing \(\varGamma _{t}^{\mathbf {a}}\) to \(\mathcal {O}(|S||A||\varGamma _{t-1}||\varOmega ||B||\varGamma _{\rho }|)\) with point-based methods.

Though \(\rho \)POMDP does not put any constraints on the definition of \(\rho \), we restrict the definition of \(\rho \) for an active perception POMDP to be a set of vectors ensuring that \(\rho \) is PWLC, which in turn ensures that the value function is PWLC. This is not a severe restriction because solving a \(\rho \)POMDP using *offline planning* requires a PWLC approximation of \(\rho \) anyway.

### 4.2 POMDPs with information rewards

Spaan et al. proposed *POMDPs with information rewards* (POMDP-IR), an alternative framework for modeling active perception tasks that relies only on the standard POMDP. Instead of directly rewarding low uncertainty in the belief, the agent is given the chance to make predictions about the hidden state and rewarded, via a standard state-based reward function, for making accurate predictions. Formally, a POMDP-IR is a POMDP in which each action \(\mathrm {a} \in A\) is a tuple \(\langle \mathbf {a}_n, a_p \rangle \) where \(\mathbf {a}_n \in A_n\) is a *normal action*, e.g., moving a robot or turning on a camera (in our case \(\mathbf {a}_{n}\) is \(\mathbf {a}\)), and \(a_p \in A_p\) is a *prediction action*, which expresses predictions about the state. The joint action space is thus the Cartesian product of \(A_n\) and \(A_p\), i.e., \(A = A_n \times A_p\).

*R*, a simple approach is to create one prediction action for each state, i.e., \(A_p = S\), and give the agent positive reward if and only if it correctly predicts the true state:

*R*, as in (1), guaranteeing a PWLC value function, as in a regular POMDP. Thus, a POMDP-IR can be solved with standard POMDP planners. However, the introduction of prediction actions leads to a blowup in the size of the joint action space \(|A| = |A_n||A_p|\) of POMDP-IR. Replacing |

*A*| with \(|A_n||A_p|\) in the analysis yields a complexity of computing \(\varGamma _{t}^{\mathbf {a}}\) for POMDP-IR of \(\mathcal {O}(|S||A_n||\varGamma _{t-1}||\varOmega ||B||A_p|)\) for point-based methods.

Note that, though not made explicit by Spaan et al. (2015), several independence properties are inherent to the POMDP-IR framework, as shown in Fig. 4. Specifically, the two important properties are (a) in our setting the reward function is independent of the normal actions; (b) the transition and the observation function are independent of the normal actions. Although POMDP-IR can model *hybrid rewards*, where in addition to prediction actions, normal actions can reward the agent as well (Spaan et al. 2015), in this article, because we focus on pure active perception, the reward function *R* is independent of the normal actions. Furthermore, state transitions and observations are independent of the prediction actions. In Sect. 6, we introduce a new technique to show that these independence properties can be exploited to solve a POMDP-IR much more efficiently and thus avoid the blowup in the size of the action space caused by the introduction of the prediction actions. Although the reward function in our setting is independent of the normal actions, the main results we present in this article are not dependent on this property and can be easily extended or applied to cases where the reward is dependent on the normal actions.

## 5 \(\rho \)POMDP and POMDP-IR equivalence

\(\rho \)POMDP and POMDP-IR offer two perspectives on modeling active perception tasks. \(\rho \)POMDP starts from a “true” belief-based reward function such as the negative entropy and then seeks to find a PWLC approximation via a set of tangents to the curve. In contrast, POMDP-IR starts from the queries that the user of the system will pose, e.g., “What is the position of everyone in the room?” or “How many people are in the room?” and creates prediction actions that reward the agent correctly for answering such queries. In this section we establish the relationship between these two frameworks by proving the *equivalence* of \(\rho \)POMDP and POMDP-IR. By equivalence of \(\rho \)POMDP and POMDP-IR, we mean that given a \(\rho \)POMDP and a policy, we can construct a corresponding POMDP-IR and a policy such that the value function for both the policies is exactly the same. We show this equivalence by starting with a \(\rho \)POMDP and a policy and introducing a *reduction* procedure for both \(\rho \)POMDP and the policy (and vice-versa). Using the reduction procedure, we reduce the \(\rho \)POMDP to a POMDP-IR and the policy for \(\rho \)POMDP to an equivalent policy for POMDP-IR. We then show that the value function, \(V^{\pi }_{t}\) for the \(\rho \)POMDP we started with and the reduced POMDP-IR is the same for the given and the reduced policy. To complete our proof, we repeat the same process by starting with a POMDP-IR and then reducing it to a \(\rho \)POMDP. We show that the value function \(V^{\pi }_{t}\) for the POMDP-IR and the corresponding \(\rho \)POMDP is the same.

### Definition 1

The set of states, set of observations, initial belief and horizon remain unchanged. Since the set of states remain unchanged, the set of all possible beliefs is also the same for \(\mathbf {M}_{ IR }\) and \(\mathbf {M}_{\rho }\).

The set of normal actions in \(\mathbf {M}_{ IR }\) is equal to the set of actions in \(\mathbf {M}_{\rho }\), i.e., \(A_{n, IR } = A_{\rho }\).

The set of prediction actions \(A_{p, IR }\) in \(\mathbf {M}_{ IR }\) contains one prediction action for each \(\alpha _{\rho }^{a_p} \in \varGamma _{\rho }\).

The transition and observation functions in \(\mathbf {M}_{ IR }\) behave the same as in \(\mathbf {M}_{\rho }\) for each \(\mathbf {a}_n\) and ignore the \(a_p\), i.e., for all \(\mathbf {a}_n \in A_{n,{ IR}}\): \(T_{ IR }(s, \mathbf {a}_n, s') = T_{\rho }(s,\mathbf {a},s')\) and \(O_{ IR }(s', \mathbf {a}_n, \mathbf {z}) = O_{\rho }(s',\mathbf {a,z})\), where \(\mathbf {a} \in A_{\rho }\) corresponds to \(\mathbf {a}_n\).

The reward function in \(\mathbf {M}_{ IR }\) is defined such that \(\forall a_p \in A_p\), \(R_{ IR }(s,a_p) = \alpha _{\rho }^{a_p}(s)\), where \(\alpha _{\rho }^{a_p}\) is the \(\alpha \)-vector corresponding to \(a_p\).

### Definition 2

*b*,

Using these definitions, we prove that solving \(\mathbf {M}_{\rho }\) is the same as solving \(\mathbf {M}_{ IR }\).

### Theorem 1

*b*,

*t*-step value function for \(\pi _{ IR }\) and \(V_{t}^{\rho }\) is the

*t*-step value function for \(\pi _{\rho }\).

### Proof

See Appendix. \(\square \)

### Definition 3

The set of states, set of observations, initial belief and horizon remain unchanged. Since the set of states remain unchanged, the set of all possible beliefs is also the same for \(\mathbf {M}_{ IR }\) and \(\mathbf {M}_{\rho }\).

The set of actions in \(\mathbf {M}_{\rho }\) is equal to the set of normal actions in \(\mathbf {M}_{ IR }\), i.e., \(A_{\rho } = A_{n, IR }\).

The transition and observation functions in \(\mathbf {M}_{\rho }\) behave the same as in \(\mathbf {M}_{ IR }\) for each \(\mathbf {a}_{n}\) and ignore the \(a_p\), i.e., for all \(\mathbf {a} \in A_{\rho }\): \(T_{\rho }(s,\mathbf {a},s') = T_{ IR }(s,\mathbf {a}_n,s')\) and \(O_{\rho }(s',\mathbf {a},\mathbf {z}) = O_{ IR }(s', \mathbf {a}_n, \mathbf {z})\) where \(\mathbf {a}_n \in A_{n, IR }\) is the action corresponding to \(\mathbf {a} \in A_{\rho }\).

The \(\varGamma _{\rho }\) in \(\mathbf {M}_{\rho }\) is defined such that, for each prediction action in \(A_{p, IR }\), there is a corresponding \(\alpha \) vector in \(\varGamma _{\rho }\), i.e., \(\varGamma _{\rho } = \{\alpha _{\rho }^{a_p}(s) : \alpha _{\rho }^{a_p}(s) = R(s,a_{p}) \text{ for } \text{ each } a_{p} \in A_{p, IR } \}\). Consequently, by definition, \(\rho \) is defined as: \(\rho (b) = \max _{\alpha _{\rho }^{a_p}}[\sum _{s}b(s)\alpha _{\rho }^{a_p}(s)]\).

### Definition 4

*b*,

### Theorem 2

*b*,

### Proof

See Appendix. \(\square \)

The main implication of these theorems is that any result that holds for either \(\rho \)POMDP or POMDP-IR also holds for the other framework. For example, the results presented in Theorem 4.3 in Araya-López et al. (2010) that bound the error in the value function of \(\rho \)POMDP also hold for POMDP-IR. Furthermore, with this equivalence, the computational complexity of solving \(\rho \)POMDP and POMDP-IR comes out to be the same, since POMDP-IR can be converted into \(\rho \)POMDP (and vice-versa) trivially, without any significant blow-up in representation. Although we have proved the equivalence of \(\rho \)POMDP and POMDP-IR only for pure active perception tasks where the reward is solely conditioned on the belief, it is straightforward to extend it to hybrid active perception tasks, where the reward is conditioned both on belief and the state. Although, the resulting active perception POMDP for dynamic sensor selection is such that the action does not affect the state, the results from this section do not use that property at all and thus are valid for active perception POMDPs where an agent might take an action which can affect the state in the next time step.

## 6 Decomposed maximization for POMDP-IR

The POMDP-IR framework enables us to formulate uncertainty as an objective, but it does so at the cost of additional computations, as adding prediction actions enlarges the action space. The computational complexity of performing a point-based backup for solving POMDP-IR is \(\mathcal {O}(|S|^2|A_n||A_p||\varOmega ||\varGamma _{t-1}|) + \mathcal {O}(|S||B||A_n||\varGamma _{t-1}||\varOmega ||A_p|)\). In this section, we present a new technique that exploits the independence properties of POMDP-IR, mainly that the transition function and the observation function are independent of the prediction actions, to reduce the computational costs. We also show that the same principle is applicable to \(\rho \)POMDPs.

*t*, this approach generates \(|A_n||\varOmega ||\varGamma _{t-1}| + |A_p|\) backprojections in \(\mathcal {O}(|S|^{2}|A_n||\varOmega ||\varGamma _{t-1}| + |S||A_p|)\) time and then prunes them to |

*B*| vectors, with a computational complexity of \(\mathcal {O}(|S||B|(|A_p| + |A_n||\varGamma _{t-1}||\varOmega |))\).

## 7 Greedy PBVI

The previous sections allow us to model the active perception task efficiently, such that the PWLC property of the value function is maintained. Thus, we can now directly employ traditional POMDP solvers that exploit this property to compute the optimal value function \(V^{*}_{t}\).While point-based methods scale better in the size of the state space, they are still not practical for our needs as they do not scale in the size of the action space of active perception POMDPs.

While the computational complexity of one iteration of PBVI is linear in the size of the action space |*A*| of a POMDP, for an active perception POMDP, the action space is modeled as selecting *K* out of the *N* available sensors, yielding |*A*| = \(\left( {\begin{array}{c}N\\ K\end{array}}\right) \). For fixed *K*, as the number of sensors *N* grows, the size of the action space and the computational cost of PBVI grows exponentially with it, making use of traditional POMDP solvers infeasible for solving active perception POMDPs.

*greedy PBVI*, a new point-based planner for solving active perception POMDPs which scales much better in the size of the action space. To facilitate the explication of greedy PBVI, we now present the final step of PBVI, described earlier in (7) and (8), in a different way. For each \(b \in B\), and \(\mathfrak {a} \in A\), we must find the best \(\alpha ^{\mathfrak {a}}_{b} \in \varGamma ^{\mathfrak {a}}_{t}\),

*b*we find the best vector across all actions: \(\alpha _{b} = \alpha ^{\mathfrak {a}^{*}}_{b}\), where

*greedy maximization*(Nemhauser et al. 1978), an algorithm that operates on a set function \(Q : 2^{X} \rightarrow \mathbb {R}\). Greedy maximization is much faster than full maximization as it avoids going over the \(\left( {\begin{array}{c}N\\ K\end{array}}\right) \) choices and instead constructs a subset of

*K*elements iteratively. Thus, we replace the maximization operator in the Bellman optimality equation with greedy maximization. Algorithm 1 shows the argmax variant, which constructs a subset \(Y \subseteq X\) of size

*K*by iteratively adding elements of

*X*to

*Y*. At each iteration, it adds the element that maximally increases

*marginal gain*\(\Delta _{Q}(e|\mathfrak {a})\) of adding a sensor

*e*to a subset of sensors \(\mathfrak {a}\):

*A*with \(\mathtt {greedy}\hbox {-}\mathtt {argmax}\). Our alternative description of PBVI above makes this straightforward: (17) contains such an argmax and

*Q*(

*b*, .) has been intentionally formulated to be a set function over \(A^{+}\). Thus, implementing greedy PBVI requires only replacing (17) with

Using point-based methods as a starting point is essential to our approach. Algorithms like Monahan’s enumeration algorithm (Monahan 1982) that rely on pruning operations to compute \(V^{*}\) instead of performing an explicit argmax, cannot directly use \(\mathtt {greedy}\hbox {-}\mathtt {argmax}\). Thus, it is precisely because PBVI operates on a finite set of beliefs that an explicit argmax is performed, opening the door to using \(\mathtt {greedy}\hbox {-}\mathtt {argmax}\) instead.

### 7.1 Bounds given submodular value function

In the following subsections, we present the highlights of the theoretical guarantees associated with greedy PBVI. The detailed analysis can be found in the appendix. Specifically, we show that a value function computed by greedy PBVI is guaranteed to have bounded error with respect to the optimal value function under *submodularity*, a property of set functions that formalizes the notion of diminishing returns. Then, we establish the conditions under which the value function of a POMDP is guaranteed to be submodular. We define \(\rho (b)\) as negative belief entropy, \(\rho (b) = -H_{b}(s)\) to establish the submodularity of value function. Both \(\rho \)POMDP and POMDP-IR approximate \(\rho (b)\) with tangents. Thus, in the last subsection, we show that even if belief entropy is approximated using tangents, the value function computed by greedy PBVI is guaranteed to have bounded error with respect to the optimal value function.

### Theorem 3

*b*,

Theorem 3 gives a bound only for a single application of \(\mathtt {greedy}\hbox {-}\mathtt {argmax}\), not for applying it within each backup, as greedy PBVI does.

*greedy Bellman operator*\(\mathfrak {B}^{G}\) be:

### Corollary 1

*b*,

### Proof

From Theorem 3 since \((\mathfrak {B}^{G}V^{\pi }_{t-1})(b) = Q^\pi _t(b,\mathfrak {a}^G)\) and \((\mathfrak {B}^{*}V^{\pi }_{t-1})(b) = Q^\pi _t(b,\mathfrak {a}^*)\). \(\square \)

Next, we define the *greedy Bellman equation*: \(V^{G}_{t}(b) = (\mathfrak {B}^{G}V^{G}_{t-1})(b)\), where \(V^{G}_{0} = \rho (b)\). Note that \(V^{G}_{t}\) is the true value function obtained by greedy maximization, without any point-based approximations. Using Corollary 1, we can bound the error of \(V^G\) with respect to \(V^*\).

### Theorem 4

*b*,

### Proof

See Appendix. \(\square \)

Theorem 4 extends Nemhauser’s result to a full sequential decision making setting where multiple application of greedy maximization are employed over multiple time steps. This theorem gives a theoretical guarantee on the performance of greedy PBVI. Given a POMDP with a submodular value function, greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. Moreover, this performance comes at a computational cost that is much less than that of solving the same POMDP with traditional solvers. Thus, greedy PBVI scales much better in the size of the action space of active perception POMDPs, while still retaining bounded error.

The results presented in this subsection are applicable only if the value function for a POMDP is submodular. In the following subsections, we establish the submodularity of the value function for the active perception POMDP under certain conditions.

### 7.2 Submodularity of value functions

The previous subsection showed that the value function computed by greedy PBVI is guaranteed to have bounded error as long as it is non-negative, monotone and submodular. In this subsection, we establish sufficient conditions for these properties to hold. Specifically, we show that, if the belief-based reward is negative entropy, i.e., \(\rho (b) = -H_{b}(s) + \log (\frac{1}{|S|})\) then under certain conditions \(Q^{\pi }_{t}(b,\mathfrak {a})\) is submodular, non-negative and monotone as required by Theorem 4. We point out that the second part, \(\log (\frac{1}{|S|})\) is only required (and sufficient) to guarantee non-negativity, but is independent of the actual beliefs or actions. For the sake of conciseness, in the remainder of this paper we will omit this term.

*k*steps to go, conditioned on the belief and action with

*t*steps to go and assuming policy \(\pi \) is followed after timestep

*t*:

*t*steps to go to

*k*steps to go, \(b^{t}\) is the belief at

*t*steps to go, \(\mathfrak {a}^{t}\) is the action taken at

*t*steps to go, and \(\rho (b^{k}) = -H_{b^k}(s^{k})\), where \(s^{k}\) is the state at

*k*steps to go. To show that \(Q^{\pi }_{t}(b,\mathfrak {a})\) is submodular the main condition is

*conditional independence*as defined below:

### Definition 5

*s*if any pair of observation features are conditionally independent given the state, i.e.,

Using above definition, the submodularity of \(Q(b,\mathfrak {a})\) can be established as:

### Theorem 5

If \(\mathfrak {z}^{t:k}\) is conditionally independent given \({s}^{k}\) and \(\rho (b) = - H_b(s)\), then \(Q^{\pi }_{t}(b,\mathfrak {a})\) is submodular in \(\mathfrak {a}\), for all \(\pi \).

### Proof

See Appendix. \(\square \)

### Theorem 6

*b*,

### Proof

See Appendix. \(\square \)

In this subsection we showed that if the immediate belief-based reward \(\rho (b)\) is defined as negative belief entropy, then the value function of an active perception POMDP is guaranteed to be submodular under certain conditions. However, as mentioned earlier, to solve active perception POMDP, we approximate the belief entropy with vector tangents. This might interfere with the submodularity of the value function. In the next subsection, we show that, even though the PWLC approximation of belief entropy might interfere with the submodularity of the value function, the value function computed by greedy PBVI is still guaranteed to have bounded error.

### 7.3 Bounds given approximated belief entropy

While Theorem 6 bounds the error in \(V^{G}_{t}(b)\), it does so only on the condition that \(\rho (b) = - H_b(s)\). However, as discussed earlier, our definition of active perception POMDPs instead defines \(\rho \) using a set of vectors \(\varGamma ^{\rho } = \{\alpha ^{\rho }_1,\ldots ,\alpha ^\rho _m\}\), each of which is a tangent to \(- H_{b}({s})\), as suggested by Araya-López et al. (2010), in order to preserve the PWLC property. While this can interfere with the submodularity of \(Q^{\pi }_{t}(b,\mathfrak {a})\), here we show that the error generated by this approximation is still bounded in this case.

*density*of the set of belief points at which tangent are drawn to the belief entropy, and

*C*is a constant.

### Theorem 7

For all beliefs, the error between \(\tilde{V}^{G}_{t}(b)\) and \(\tilde{V}^{*}_{t}(b)\) is bounded, if \(\rho (b) = -H_{b}(s)\), and \(\mathfrak {z}^{t:k}\) is conditionally independent given \(s^{k}\).

### Proof

See Appendix. \(\square \)

In this subsection we showed that if the negative entropy is approximated using tangent vectors, greedy PBVI still computes a value function that has bounded error. In the next subsection we outline how greedy PBVI can be extended to general active perception tasks.

### 7.4 General active perception POMDPs

The results presented in this section apply to the active perception POMDP in which the evolution of the state over time is independent of the actions of the agent. Here, we outline how these results can be extended to general active perception POMDPs without many changes. The main application for such an extension is in tasks involving a mobile robot coordinating with sensors to intelligently take actions to perceive its environment. In such cases, the robot’s actions, by causing it to move, can change the state of the world.

The algorithms we proposed can be extended to such settings by making small modifications to the greedy maximization operator. The greedy algorithm can be run for \(K+1\) iterations and in each iteration the algorithm would choose to add either a sensor (only if fewer than *K* sensors have been selected), or a movement action (if none has been selected so far). Formally, using the work of Fisher et al. (1978), which extends that of Nemhauser et al. (1978) on submodularity to combinatorial structures such as *matroids*, the action space of a POMDP involving a mobile robot can be modeled as a *partition matroid* and greedy maximization subject to matroid constraints (Fisher et al. 1978) can be used to maximize the value function approximately.

The guarantees associated with greedy maximization subject to matroid constraints (Fisher et al. 1978) can then be used to bound the error of greedy PBVI. However, deriving exact theoretical guarantees for greedy PBVI for such tasks is beyond the scope of this article. Assuming that the reward function is still defined as the negative belief entropy, the submodularity of such POMDPs still holds under the conditions mentioned in Sect. 7.2.

In this section, we presented greedy PBVI, which uses greedy maximization to improve the scalability in the action space of an active perception POMDP. We also showed that, if the value function of an active perception POMDP is submodular, then greedy PBVI computes a value function that is guaranteed to have bounded error. We established that if the belief-based reward is defined as the negative belief entropy, then the value function of an active perception POMDP is guaranteed to be submodular. We showed that if the negative belief entropy is approximated by tangent vectors, as is required to solve active perception POMDPs efficiently, greedy PBVI still computes a value function that has bounded error. Finally, we outlined how greedy PBVI and the associated theoretical bounds can be extended to general active perception POMDPs.

## 8 Experiments

We compare the performance of POMDP-IR with decomposed maximization to a naive POMDP-IR that does not decompose the maximization. Thanks to Theorems 1 and 2, these approaches have performance equivalent to their \(\rho \)POMDP counterparts. We also compare against two baselines. The first is a weak baseline we call the *rotate policy* in which the agent simply keeps switching between cameras on a turn-by-turn basis. The second is a stronger baseline we call the *coverage policy*, which was developed in earlier work on active perception (Spaan 2008; Spaan and Lima 2009). The coverage policy is obtained after solving a POMDP that rewards the agent for observing the person, i.e., the agent is encouraged to select the cameras that are most likely to generate positive observations. Thanks to the decomposed maximization, the computational cost of solving for the coverage policy and belief-based rewards is the same.

### 8.1 Simulated setting

We start with experiments conducted in a simulated setting, first considering the task of tracking a single person with a multi-camera system and then considering the more challenging task of tracking multiple people.

#### 8.1.1 Single-person tracking

We start by considering the task of tracking one person walking in a grid-world composed of |*S*| cells and *N* cameras as shown in Fig. 5. At each timestep, the agent can select only *K* cameras, where \(K \le N\). Each selected camera generates a noisy observation of the person’s location. The agent’s goal is to minimize its uncertainty about the person’s state. In the experiments in this section, we fixed \(K = 1\) and \(N=10\). The problem setup and the POMDP model is shown and described in Fig. 5.

Figures 6c, d illustrate the qualitative difference between POMDP-IR and the coverage policy. The blue lines mark the points in trajectory when the agent selected the camera that observes the person’s location. If the agent selected a camera such that the person’s location is not covered then the blue vertical line is not there at that point in the trajectory in the figure. The agent has to select one out of *N* cameras and does not have an option of not selecting any camera. The red line plots the max of the agent’s belief. The main difference between the two policies is that once POMDP-IR gets a good estimate of the state, it proactively observes neighboring cells to which the person might transition. This helps it to more quickly find the person when she moves. By contrast, the coverage policy always looks at the cell where it believes her to be. Hence, it takes longer to find her again when she moves. This is evidenced by the fluctuations in the max of the belief, which often drops below 0.5 for the coverage policy but rarely does so for POMDP-IR.

Next, we compare the performance of POMDP-IR to a myopic variant that seeks only to maximize immediate reward, i.e., \(h=1\). We perform this comparison in three variants of the task. In the *highly static* variant, the state changes very slowly: the probability of staying is the same state is 0.9. In the *moderately dynamic* variant, the state changes more frequently, with a same-state transition probability of 0.7. In the *highly dynamic* variant, the state changes rapidly (with a same-state transition probability of 0.5). Figure 8 (top) shows the results of these comparisons. In each setting, non-myopic POMDP-IR outperforms myopic POMDP-IR. In the highly static variant, the difference is marginal. However, as the task becomes more dynamic, the importance of look-ahead planning grows. Because the myopic planner focuses only on immediate reward, it ignores what might happen to its belief when the state changes, which happens more often in dynamic settings.

We also compare the performance of myopic and non-myopic planning in a *budget-constrained* environment. This specifically corresponds to an energy constrained environment, where cameras can be employed only a few times over the entire trajectory. This is augmented with resource constraints, so that the agent has to plan not only when to use the cameras, but also decide which camera to select. Specifically, the agent can only employ the multi-camera system a total of 15 times across all 50 timesteps and the agent can select which camera (out of the multi-camera system) to employ at each of the 15 instances. On the other timesteps, it must select an action that generates only a null observation. Figure 8 (bottom) shows that non-myopic planning is of critical importance in this setting. Whereas myopic planning greedily consumes the budget as quickly as possible, thus earning more reward in the beginning, non-myopic planning saves the budget for situations in which it is highly uncertain about the state.

Performance is quantified as the total number of times the correct location of the person is predicted by the system. Figure 9, which shows the performance of myopic and non-myopic policies for this task, demonstrates that when planning non-myopically the agent is able to utilize the accurate sensors more effectively as to compared to when planning myopically.

#### 8.1.2 Multi-person tracking

*S*| grows exponentially in the number of people, the resulting POMDP quickly becomes intractable. Therefore, we compute instead a factored value function

*i*-th person. Thus, \(V_{t}^{i}(b^{i})\) needs to be computed only once, by solving a POMDP of the same size as that in the single-person setting. During action selection, \(V_{t}(b)\) is computed using the current \(b^{i}\) for each person. This kind of factorization corresponds to the assumption that each person’s movement and observations is independent of that of other people. Although violated in practice, such an assumption can nonetheless yield good approximations.

Figure 10 (top), which compares POMDP-IR to the coverage policy with one, two, and three people, shows that the advantage of POMDP-IR grows substantially as the number of people increases. Whereas POMDP-IR tries to maintain a good estimate of everyone’s position, the coverage policy just tries to look at the cells where the maximum number of people might be present, ignoring other cells completely.

### 8.2 Real data

Finally, we extended our analysis to a real-life dataset collected in a shopping mall. This dataset was gathered over 4 hours using 13 CCTV cameras located in a shopping mall (Bouma et al. 2013). Each camera uses a FPDW (Dollar et al. 2010) pedestrian detector to detect people in each camera image and in-camera tracking (Bouma et al. 2013) to generate tracks of the detected people’s movements over time.

The dataset consists of 9915 tracks each specifying one person’s *x*–*y* position over time. Figure 11 shows the sample tracks from all of the cameras.

Because the cameras have many overlapping regions (see Fig. 11), we were able to manually match tracks of the same person recorded individually by each camera. The “ground truth” was then constructed by taking a weighted mean of the matched tracks. Finally, this ground truth was used to estimate noise parameters for each cell (assuming zero-mean Gaussian noise), which was used as the observation function. Figure 12 shows that, as before, POMDP-IR substantially outperforms the coverage policy for various numbers of cameras. In addition to the reasons mentioned before, the high overlap between the cameras contributes to POMDP-IR’s superior performance. The coverage policy has difficulty ascertaining people’s exact locations because it is rewarded only for observing them somewhere in a camera’s large overlapping region, whereas POMDP-IR is rewarded for deducing their exact locations.

### 8.3 Greedy PBVI

To empirically evaluate greedy PBVI, we tested it on the problem of tracking either one person or multiple people using a multi-camera system.

The reward function is described as a set of |*S*| vectors, \(\varGamma ^{\rho } = \{\alpha _{1} \dots \alpha _{|S|} \}\), with \(\alpha _{i}(s) = 1\) if \(s=i\) and \(\alpha _{i}(s) = 0\) otherwise. The initial belief is uniform across all states. We planned for horizon \(h = 10\) with a discount factor \(\gamma = 0.99\).

*myopic*versions of both greedy and regular PBVI that compute a policy assuming \(h=1\) and use it at each timestep. Figure 13 shows runtimes under different values of

*N*and

*K*. Since multi-person tracking uses the value function obtained by solving a single-person POMDP, single and multi-person tracking have the same runtimes. These results demonstrate that greedy PBVI requires only a fraction of the computational cost of regular PBVI. In addition, the difference in the runtime grows quickly as the action space gets larger: for \(N=5\) and \(K=2\) greedy PBVI is twice as fast, while for \(N=11, K=3\) it is approximately nine times as fast. Thus, greedy PBVI enables much better scalability in the action space. Figure 14, which shows the cumulative reward under different values of

*N*and

*K*for single-person (top) and multi-person (bottom) tracking, verifies that greedy PBVI’s speedup does not come at the expense of performance, as greedy PBVI accumulates nearly as much reward as regular PBVI. They also show that both PBVI and greedy PBVI benefit from non-myopic planning. While the performance advantage of non-myopic planning is relatively modest, it increases with the number of cameras and people, which suggests that non-myopic planning is important to making active perception scalable.

## 9 Discussion and conclusions

In this article, we addressed the problem of active perception, in which an agent must take actions to reduce uncertainty about a hidden variable while reasoning about various constraints. Specifically, we modeled the task of surveillance with multi-camera tracking systems in large urban spaces as an active perception task. Since the state of the environment is dynamic, we model this task as a POMDP to compute closed-loop non-myopic policies that can reason about the long-term consequences of selecting a subset of sensors.

Formulating uncertainty reduction as an end in itself is a challenging task, as it breaks the PWLC property of the value function, which is imperative for solving POMDPs efficiently. \(\rho \)POMDP and POMDP-IR are two frameworks that allow formulating uncertainty reduction as an end in itself and do not break the PWLC property.

We showed that \(\rho \)POMDP and POMDP-IR are two equivalent frameworks for modeling active perception tasks. Thus, results that apply to one framework are also applicable to the other one. While \(\rho \)POMDP does not restrict the definition of \(\rho \) to a PWLC function, in this work we restrict the definition of \(\rho \)POMDP to a case where \(\rho \) is approximated with a PWLC function, as it is not feasible to efficiently solve a \(\rho \)POMDP where \(\rho \) is not a PWLC function.

We model the action space of the active perception POMDP as selecting *K* out of *N* sensors, where *K* is the maximum number of sensors allowed by the resource constraints. Recent POMDP solvers enable scalability in the state space. However, for active perception, as the number of sensors grow, the action space grows exponentially. We proposed greedy PBVI, a POMDP planning method, that improves scalability in the action space of a POMDP. While we do not directly address the scaling in the observation space, we believe recent ideas on the factorization of the observation space (Veiga et al. 2014) can be combined with our approach to improve scalability in the state, action and observation spaces to solve active perception POMDPs.

By leveraging the theory of submodularity, we showed that the value function computed by greedy PBVI is guaranteed to have bounded error. Specifically, we extend Nemhauser’s result on greedy maximization of submodular functions to long-term planning. To apply these results to the active perception task, we showed that under certain conditions the value function of an active perception POMDP is submodular. One such condition requires that the series of future of observations be independent of each other given the state. While this is a strong condition, it is only a sufficient condition and not may not be a necessary one. Thus, one line of future work is to attempt to relax this condition for proving the submodularity of the value function. Finally, we showed that, even with a PWLC approximation to the true value function, which is submodular, the error in the value function computed by greedy PBVI remains bounded, thus enabling us to compute efficiently value functions for active perception POMDPs.

Greedy PBVI is ideally suited for active perception POMDPs for which the value function is submodular. However, in real-life situations submodularity of value function might not always hold. For example, when there is occlusion in our setting, it is possible for combinations of sensors that when selected together yield higher utility than the sum of their utilities when selected individually. Similar cases can arise when a mobile robot is trying to sense the best point of view to observe a scene that is occluded. Thus in cases like these, greedy PBVI might not return the best solution.

Our empirical analysis established the critical factors involved in the performance of active perception tasks. We showed that a belief-based formulation of uncertainty reduction beats a corresponding state-based reward baseline as well as other simple policies. While the non-myopic policy beats the myopic one, in certain cases the gain is marginal. However, in cases involving mobile sensors and budgeted constraints, non-myopic policies become critically important. Finally, experiments on a real-world dataset showed that the performance of greedy PBVI is similar to the existing methods but requires only a fraction of the computational cost, leading to much better scalability for solving active perception tasks.

## Footnotes

- 1.
This article extends the research already presented by Satsangi et al. (2015) at AAAI 2015. In this article, we present additional theoretical results on equivalence of POMDP-IR and \(\rho \)POMDP, a new technique that exploits the independence properties of POMDP-IR to solve it more efficiently, and we present a detailed empirical analysis of belief-based rewards for POMDPs in active perception tasks.

- 2.
We make this assumption without loss of generality. The following sections clarify that none of our results require this assumption.

- 3.
Arguably, there is a counter-intuitive relation between the general class of POMDPs and the sub-class of pure active perception problems: on the one hand, the class of POMDPs is a more general set of problems, and it is intuitive to assume that there might be harder problems in the class. On the other hand, many POMDP problems admit a representation of the value function using a finite set of vectors. In contrast, the use of entropy would require an infinite number of vectors to merely represent the reward function. Therefore, even though we consider a specific sub-class of POMDPs, this class has properties that make it difficult to address using existing methods.

## Notes

### Acknowledgements

We thank Henri Bouma and TNO for providing us with the dataset used in our experiments. We also thank the STW User Committee for its advice regarding active perception for multi-camera tracking systems. This research is supported by the Dutch Technology Foundation STW (project #12622), which is part of the Netherlands Organisation for Scientific Research (NWO), and which is partly funded by the Ministry of Economic Affairs. Frans Oliehoek is funded by NWO Innovational Research Incentives Scheme Veni #639.021.336.

### References

- Araya-López, M., Thomas, V., Buffet, O., & Charpillet, F. (2010). A POMDP extension with belief-dependent rewards. In
*Advances in neural information processing systems*(pp. 64–72). MIT Press.Google Scholar - Aström, K. J. (1965). Optimal control of Markov decision processes with incomplete state estimation.
*Journal of Mathematical Analysis and Applications*,*10*, 174–205.MathSciNetCrossRefMATHGoogle Scholar - Bajcsy, R. (1988). Active perception.
*Proceedings of the IEEE*,*76*(8), 966–1005.CrossRefGoogle Scholar - Bertsekas, D. P. (2007).
*Dynamic programming and optimal control*(3rd ed., Vol. II). Belmont: Athena Scientific.MATHGoogle Scholar - Bonet, B., & Geffner, H. (2009). Solving pomdps: Rtdp-bel vs. point-based algorithms. In
*Proceedings of the Twenty-First International Jont Conference on Artifical Intelligence, IJCAI’09*(pp. 1641–1646).Google Scholar - Bouma, H., Baan, J., Landsmeer, S., Kruszynski, C., van Antwerpen, G., & Dijk, J. (2013). Real-time tracking and fast retrieval of persons in multiple surveillance cameras of a shopping mall. In
*Proceedings of SPIE*(Vol. 8756, pp. 87560A–1).Google Scholar - Burgard, W., Fox, D., & Thrun, S. (1997). Active mobile robot localization by entropy minimization. In
*Proceedings of the Second EUROMICRO Workshop on Advanced Mobile Robots 1997*(pp. 155–162). IEEE.Google Scholar - Chen, Y., Javdani, S., Karbasi, A., Bagnell, J. A., Srinivasa, S., & Krause, A. (2015). Submodular surrogates for value of information. In
*Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence*(pp. 3511–3518).Google Scholar - Cheng, H. T. (1988).
*Algorithms for partially observable Markov decision processes*. Ph.D. thesis, University of British Columbia.Google Scholar - Cover, T. M., & Thomas, J. A. (1991). Entropy, relative entropy and mutual information. In
*Elements of information theory*(pp. 12–49). Wiley.Google Scholar - Dollar, P., Belongie, S., & Perona, P. (2010). The fastest pedestrian detector in the west. In
*Proceedings of the British Machine Vision Conference, BMVA Press*(pp. 68.1–68.11).Google Scholar - Eck, A., & Soh, L. K. (2012). Evaluating POMDP rewards for active perception. In
*Proceedings of the Eleventh International Conference on Autonomous Agents and Multiagent Systems*(pp. 1221–1222).Google Scholar - Fisher, M. L., Nemhauser, G. L., & Wolsey, L. A. (1978).
*An analysis of approximations for maximizing submodular set functions—II*. Berlin: Springer.CrossRefMATHGoogle Scholar - Gilbarg, D., & Trudinger, N. (2001).
*Elliptic partial differential equations of second order*. Washington: U.S. Government Printing Office.MATHGoogle Scholar - Golovin, D., & Krause, A. (2011). Adaptive submodularity: Theory and applications in active learning and stochastic optimization.
*Journal of Artificial Intelligence Research (JAIR)*,*42*, 427–486.MathSciNetMATHGoogle Scholar - Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes.
*Journal of Artificial Intelligence Research*,*13*, 33–94.MathSciNetMATHGoogle Scholar - Ji, S., Parr, R., & Carin, L. (2007). Nonmyopic multiaspect sensing with partially observable Markov decision processes.
*IEEE Transactions on Signal Processing*,*55*, 2720–2730.MathSciNetCrossRefGoogle Scholar - Joshi, S., & Boyd, S. (2009). Sensor selection via convex optimization.
*IEEE Transactions on Signal Processing*,*57*, 451–462.MathSciNetCrossRefGoogle Scholar - Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains.
*Artificial Intelligence*,*101*, 99–134.MathSciNetCrossRefMATHGoogle Scholar - Kochenderfer, M. J. (2015).
*Decision making under uncertainty: Theory and application*. Cambridge: MIT Press.MATHGoogle Scholar - Krause, A., & Golovin, D. (2014). Submodular function maximization. In L. Bordeaux, Y. Hamadi, & P. Kohli (Eds.),
*Tractability: Practical approaches to hard problems*. Cambridge: Cambridge University Press.Google Scholar - Krause, A., & Guestrin, C. (2005). Optimal nonmyopic value of information in graphical models—efficient algorithms and theoretical limits. In
*Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence*(pp. 1339–1345).Google Scholar - Krause, A., & Guestrin, C. (2007). Near-optimal observation selection using submodular functions. In
*Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence*(pp. 481–492).Google Scholar - Krause, A., & Guestrin, C. (2009). Optimal value of information in graphical models.
*Journal of Artificial Intelligence Research*,*35*, 557–591.MathSciNetMATHGoogle Scholar - Kreucher, C., Kastella, K., & Hero, A. O., III. (2005). Sensor management using an active sensing approach.
*Signal Processing*,*85*, 607–624.Google Scholar - Krishnamurthy, V., & Djonin, D. V. (2007). Structured threshold policies for dynamic sensor scheduling—a partially observed Markov decision process approach.
*IEEE Transactions on Signal Processing*,*55*(10), 4938–4957.MathSciNetCrossRefGoogle Scholar - Kumar, A., & Zilberstein, S. (2009). Event-detecting multi-agent MDPs: Complexity and constant-factor approximation. In
*Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence*(pp. 201–207).Google Scholar - Kurniawati, H., Hsu, D., & Lee, W. S. (2008). Sarsop: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In
*Proceedings robotics: Science and systems*.Google Scholar - Kurniawati, H., Du, Y., Hsu, D., & Lee, W. S. (2011). Motion planning under uncertainty for robotic tasks with long time horizons.
*The International Journal of Robotics Research*,*30*(3), 308–323.CrossRefMATHGoogle Scholar - Littman, M. L. (1996).
*Algorithms for sequential decision making*. Ph.D. thesis, Brown University.Google Scholar - Lovejoy, W. S. (1991). Computationally feasible bounds for partially observed Markov decision processes.
*Operations Research*,*39*, 162–175.MathSciNetCrossRefMATHGoogle Scholar - Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms.
*Management Science*,*28*, 1–16.MathSciNetCrossRefMATHGoogle Scholar - Natarajan, P., Hoang, T. N., Low, K. H., Kankanhalli, M. (2012). Decision-theoretic approach to maximizing observation of multiple targets in multi-camera surveillance. In
*Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems*(pp. 155–162).Google Scholar - Nemhauser, G., Wolsey, L., & Fisher, M. (1978). An analysis of approximations for maximizing submodular set functions—I.
*Mathematical Programming*,*14*, 265–294.MathSciNetCrossRefMATHGoogle Scholar - Oliehoek, F. A., Whiteson, S., & Spaan, M. T. J. (2013). Approximate solutions for factored Dec-POMDPs with many agents. In
*Proceedings of the Twelfth International Joint Conference on Autonomous Agents and Multiagent Systems*(pp. 563–570).Google Scholar - Pineau, J., & Gordon, G. J. (2007). POMDP planning for robust robot control. In S. Thrun, R. Brooks, & H. Durrant-Whyte (Eds.),
*Robotics Research*(pp. 69–82). Springer.Google Scholar - Pineau, J., Gordon, G. J., & Thrun, S. (2006). Anytime point-based approximations for large POMDPs.
*Journal of Artificial Intelligence Research*,*27*, 335–380.MATHGoogle Scholar - Poupart, P. (2005).
*Exploiting structure to efficiently solve large scale partially observable Markov decision processes*. Ph.D. thesis, University of Toronto.Google Scholar - Raphael, C., & Shani, G. (2012). The skyline algorithm for POMDP value function pruning.
*Annals of Mathematics and Artificial Intelligence*,*65*(1), 61–77.MathSciNetCrossRefMATHGoogle Scholar - Ross, S., Pineau, J., Paquet, S., & Chaib-Draa, B. (2008). Online planning algorithms for POMDPs.
*Journal of Artificial Intelligence Research*,*32*, 663–704.MathSciNetMATHGoogle Scholar - Satsangi, Y., Whiteson, S., & Oliehoek, F. (2015). Exploiting submodular value functions for faster dynamic sensor selection. In
*AAAI 2015: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence*(pp. 3356–3363).Google Scholar - Satsangi, Y., Whiteson, S., & Oliehoek, F. A. (2016). PAC greedy maximization with efficient bounds on information gain for sensor selection. In
*Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16*(pp. 3220–3227). AAAI Press.Google Scholar - Shani, G., Pineau, J., & Kaplow, R. (2013). A survey of point-based POMDP solvers.
*Autonomous Agents and Multi-Agent Systems*,*27*(1), 1–51.CrossRefGoogle Scholar - Silver, D., Veness, J. (2010). Monte-carlo planning in large POMDPs. In
*Advances in neural information processing systems*(pp. 2164–2172).Google Scholar - Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov processes over a finite horizon.
*Operations Research*,*21*, 1071–1088.CrossRefMATHGoogle Scholar - Sondik, E. J. (1971). The optimal control of partially observable Markov processes. Ph.D. thesis, Stanford University, California, United States.Google Scholar
- Spaan, M. T. J. (2008). Cooperative active perception using POMDPs. In
*AAAI Conference on Artificial Intelligence 2008: Workshop on Advancements in POMDP Solvers*.Google Scholar - Spaan, M. T. J. (2012). Partially observable Markov decision processes. In M. Wiering & M. van Otterlo (Eds.),
*Reinforcement learning: State of the art*(pp. 387–414). Berlin: Springer.CrossRefGoogle Scholar - Spaan, M. T. J., & Lima, P. U. (2009). A decision-theoretic approach to dynamic sensor selection in camera networks. In
*International Conference on Automated Planning and Scheduling*(pp. 279–304).Google Scholar - Spaan, M. T. J., & Vlassis, N. (2005). Perseus: Randomized point-based value iteration for POMDPs.
*Journal of Artificial Intelligence Research*,*24*, 195–220.MATHGoogle Scholar - Spaan, M. T. J., Veiga, T. S., & Lima, P. U. (2010). Active cooperative perception in network robot systems using POMDPs. In
*International Conference on Intelligent Robots and Systems*(pp. 4800–4805).Google Scholar - Spaan, M. T. J., Veiga, T. S., & Lima, P. U. (2015). Decision-theoretic planning under uncertainty with information rewards for active cooperative perception.
*Autonomous Agents and Multi-Agent Systems*,*29*, 1157–1185.CrossRefGoogle Scholar - Veiga, T., Spaan, M. T. J., & Lima, P. U. (2014). Point-based POMDP solving with factored value function approximation. In
*AAAI 2014: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence*.Google Scholar - White, C. C. (1991). A survey of solution techniques for the partially observed Markov decision process.
*Annals of Operations Research*,*32*(1), 215–230.MathSciNetCrossRefMATHGoogle Scholar - Williams, J., Fisher, J., & Willsky, A. (2007). Approximate dynamic programming for communication-constrained sensor network management.
*IEEE Transactions on Signal Processing*,*55*, 4300–4311.MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.