Exploiting Submodular Value Functions For Scaling Up Active Perception

In active perception tasks, an agent aims to select sensory actions that reduce its uncertainty about one or more hidden variables. While partially observable Markov decision processes (POMDPs) provide a natural model for such problems, reward functions that directly penalize uncertainty in the agent's belief can remove the piecewise-linear and convex property of the value function required by most POMDP planners. Furthermore, as the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially with it, making POMDP planning infeasible with traditional methods. In this article, we address a twofold challenge of modeling and planning for active perception tasks. We show the mathematical equivalence of $\rho$POMDP and POMDP-IR, two frameworks for modeling active perception tasks, that restore the PWLC property of the value function. To efficiently plan for active perception tasks, we identify and exploit the independence properties of POMDP-IR to reduce the computational cost of solving POMDP-IR (and $\rho$POMDP). We propose greedy point-based value iteration (PBVI), a new POMDP planning method that uses greedy maximization to greatly improve scalability in the action space of an active perception POMDP. Furthermore, we show that, under certain conditions, including submodularity, the value function computed using greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. We establish the conditions under which the value function of an active perception POMDP is guaranteed to be submodular. Finally, we present a detailed empirical analysis on a dataset collected from a multi-camera tracking system employed in a shopping mall. Our method achieves similar performance to existing methods but at a fraction of the computational cost leading to better scalability for solving active perception tasks.


Introduction
Multi-sensor systems are becoming increasingly prevalent in a wide-range of settings. For example, multicamera systems are now routinely used for security, surveillance and tracking (Kreucher et al, 2005;Natarajan et al, 2012;Spaan et al, 2015). A key challenge in the design of these systems is the efficient allocation of scarce resources such as the bandwidth required to communicate the collected data to a central server, the arXiv:2009.09696v1 [cs.AI] 21 Sep 2020 CPU cycles required to process that data, and the energy costs of the entire system (Kreucher et al, 2005;Williams et al, 2007;Spaan and Lima, 2009). For example, state of the art human activity recognition algorithms require high resolution video streams coupled with significant computational resources. When a human operator must monitor many camera streams, displaying only a small number of them can reduce the operator's cognitive load. IP-cameras connected directly to a local area network need to share bandwidth. Such constraints gives rise to the dynamic sensor selection (Satsangi et al, 2015) 1 problem where an agent at each time step, must select K out of the N available sensors to allocate these resources to, where K is the maximum number of sensors allowed given the resource constraints.
For example, consider the surveillance task, in which a mobile robot aims to minimize its future uncertainty about the state of the environment but can use only K of its N sensors at each time step. Surveillance is an example of an active perception (Bajcsy, 1988) task, where an agent takes actions to reduce uncertainty about one or more hidden variables, while reasoning about various resource constraints. When the state of the environment is static, a myopic approach that always selects actions that maximize the immediate expected reduction in uncertainty is typically sufficient. However, when the state changes over time, a non-myopic approach that reasons about the long-term effects of action selection performed at each time step can be better. For example, in the surveillance task, as the robot moves and the state of the environment changes, it becomes essential to reason about the longterm consequences of the robot's actions to minimize the future uncertainty.
A natural decision-theoretic model for such an approach is the partially observable Markov decision process (POMDP) (Sondik, 1971;Kaelbling et al, 1998;Kochenderfer, 2015). POMDPs provide a comprehensive and powerful framework for planning under uncertainty. They can model the dynamic and partially observable state and express the goals of the systems in terms of rewards associated with state-action pairs. This model of the world can be used to compute closedloop, long term policies that can help the agent to decide what actions to take given a belief about the state of the environment (Burgard et al, 1997;Kurniawati et al, 2010).
In a typical POMDP, reducing uncertainty about the state is only a means to an end. For example, a robot whose goal is to reach a particular location may take sensing actions that reduce its uncertainty about its current location because doing so helps it determine what future actions will bring it closer to its goal. By contrast, in active perception problems reducing uncertainty is an end in itself. For example, in the surveillance task, the system's goal is typically to ascertain the state of its environment, not use that knowledge to achieve a goal. While perception is arguably always performed to aid decision-making, in an active perception problem that decision is made by another agent such as a human, that is not modeled as a part of the POMDP. For example, in the surveillance task, the robot might be able to detect a suspicious activity but only the human users of the system may decide how to react to such an activity.
One way to formulate uncertainty reduction as an end in itself is to define a reward function whose additive inverse is some measure of the agent's uncertainty about the hidden state, e.g., the entropy of its belief. However this formulation leads to a reward function that conditions on the belief, rather than the state and the resulting value function is not PWLC, which makes many traditional POMDP solvers inapplicable. There exists online planning methods (Silver and Veness, 2010;Bonet and Geffner, 2009), which generates policy on the fly, that do not require the PWLC property of the value function. However, many of these methods require multiple 'hypothetical' belief updates to compute the optimal policy, which makes them unsuitable for sensor selection where the optimal policy must be computed in a fraction of a second. There exists other online planning methods that do not require hypothetical belief updates (Silver and Veness, 2010), but since we are dealing with belief based rewards, they cannot be directly applied here. Here, we address the case of offline planning where the policy is computed before execution of the task.
Thus, to efficiently solve active perception problems, we must (a) model the problem with minimizing uncertainty as the objective while maintaining a PWLC value function and (b) use this model to solve the POMDP efficiently. Recently, two frameworks have been proposed, ρPOMDP (Araya-López et al, 2010) and POMDP with Information Reward (POMDP-IR) (Spaan et al, 2015) to efficiently model active perception tasks, such that the PWLC property of the value function is maintained. The idea behind ρPOMDP is to find a PWLC approximation to the "true" continuous belief-based reward function, and then solve it with the traditional solvers. POMDP-IR, on the other hand, allows the agent to make predictions about the hidden state and the agent is rewarded for accurate predictions via a state-based reward function. There is no research that examines the relationship between these two frameworks, their pros and cons, or their efficacy in realistic tasks, thus it is not clear how to choose between these two frameworks to model the active perception problems.
In this article, we address the problem of efficient modeling and planning for active perception tasks. First, we study the relationship between ρPOMDP and POMDP-IR. Specifically, we establish equivalence between them by showing that any ρPOMDP can be reduced to a POMDP-IR (and vice-versa) that preserves the value function for equivalent policies. Having established the theoretical relationship between ρPOMDP and POMDP-IR, we model the surveillance task as a POMDP-IR and propose a new method to solve it efficiently by exploiting a simple insight that lets us decompose the maximization over prediction actions and normal actions while computing the value function.
Although POMDPs are computationally difficult to solve, recent methods (Littman, 1996;Hauskrecht, 2000;Pineau et al, 2006;Spaan and Vlassis, 2005;Poupart, 2005;Ji et al, 2007;Kurniawati et al, 2008;Shani et al, 2012) have proved successful in solving POMDPs with large state spaces. Solving active perception POMDPs pose a different challenge: as the number of sensors grows, the size of the action space N K grows exponentially with it. Current POMDP solvers fail to address scalability in the action space of a POMDP. We propose a new point-based planning method that scales much better in the number of sensors for such POMDPs. The main idea is to replace the maximization operator in the Bellman optimality equation with greedy maximization in which a subset of sensors is constructed iteratively by adding the sensor that gives the largest marginal increase in value.
We present theoretical results bounding the error in the value functions computed by this method. We prove that, under certain conditions including submodularity, the value function computed using POMDP backups based on greedy maximization has bounded error. We achieve this by extending the existing results  for the greedy algorithm, which are valid only for a single time step, to a full sequential decision making setting where the greedy operator is employed multiple times over multiple time steps. In addition, we show that the conditions required for such a guarantee to hold are met, or approximately met, if the reward is defined using negative belief entropy.
Finally, we present a detailed empirical analysis on a real-life dataset from a multi-camera tracking system installed in a shopping mall. We identify and study the critical factors relevant to the performance and behavior of the agent in active perception tasks. We show that our proposed planner outperforms a myopic baseline and nearly matches the performance of existing pointbased methods while incurring only a fraction of the computational cost, leading to much better scalability in the number of cameras.

Related Work
Sensor selection as an active perception task has been studied in many contexts. Most work focus on either open-loop or myopic solutions, e.g., (Kreucher et al, 2005), (Spaan and Lima, 2009), (Williams et al, 2007), (Joshi and Boyd, 2009). Kreucher et al (2005) proposes a Monte-Carlo approach that mainly focuses on a myopic solution. Williams et al (2007) and Joshi and Boyd (2009) developed planning methods that can provide long-term but open-loop policies. By contrast, a POMDP-based approach enables a closed-loop, nonmyopic approach can lead to a better performance when the underlying state of the world changes over time. Spaan (2008), Lima (2009), Spaan et al (2010) and Natarajan et al (2012) also consider a POMDP-based approach to active and cooperative active perception. However, they consider an objective function that conditions on state and not on belief, as the belief-dependent rewards in POMDP break the PWLC property of the value function. They use pointbased methods (Spaan and Vlassis, 2005) for solving the POMDPs. While recent point-based methods (Shani et al, 2012) for solving POMDPs scale reasonably in the state space of POMDPs, they do not address the scalability in the action and observation space of a POMDP.
Greedy PBVI focuses specially on the scalability in the action space of an active perception POMDP and provides better scalability by leveraging greedy maximization. Traditionally, POMDPs require the reward function to be defined as a function of the state. However, for active perception POMDPs, the objective is to reduce the uncertainty in the belief of the agent.
In recent years, applying greedy maximization to submodular functions has become a popular and effective approach to sensor placement/selection Guestrin, 2005, 2007;Kumar and Zilberstein, 2009). However, such work focuses on myopic or fully observable settings and thus does not enable the long-term planning required to cope with dynamic state in a POMDP.
Adaptive submodularity (Golovin and Krause, 2011) is a recently developed extension that addresses these limitations by allowing action selection to condition on previous observations. However, it assumes a static state and thus cannot model the dynamics of a POMDP across timesteps. Therefore, in a POMDP, adaptive submodularity is only applicable within a timestep, during which state does not change but the agent can sequentially add sensors to a set. In principle, adaptive submodularity could enable this intra-timestep sequential process to be adaptive, i.e., the choice of later sensors could condition on the observations generated by earlier sensors. However, this is not possible in our setting because (a) we assume that, due to computational costs, all sensors must be selected simultaneously; (b) information gain is not known to be adaptive submodular (Chen et al, 2015). Consequently, our analysis considers only classic, non-adaptive submodularity.
To our knowledge, our work is the first to establish the sufficient conditions for the submodularity of POMDP value functions for active perception POMDPs and thus leverage greedy maximization to scalably compute bounded approximate policies for dynamic sensor selection modeled as a full POMDP.

Background
In this section, we provide background on POMDPs, active perception POMDPs and solution methods for POMDPs.

Partially Observable Markov Decision Processes
POMDPs provide a decision-theoretic framework for modeling partial observability and dynamic environments. Formally, a POMDP is defined by a tuple S, A, Ω, T, O, R, b 0 , h . At each time step, the environment is in a state s ∈ S, the agent takes an action a ∈ A and receives a reward whose expected value is R(s, a), and the system transitions to a new state s ∈ S according to the transition function T (s, a, s ) = Pr(s |s, a). Then, the agent receives an observation z ∈ Ω according to the observation function O(s , a, z) = Pr(z|s , a). Starting from an initial belief b 0 , the agent maintains a belief b(s) about the state which is a probability distribution over all the possible states. The number of time steps for which the decision process lasts, i.e., the horizon is denoted by h. If the agent took action a in belief b and got an observation z, then the updated belief b a,z (s) can be computed using Bayes rule. A policy π specifies how the agent acts in each belief. Given b(s) and R(s, a), one can compute a belief-based reward, ρ(b, a) as: (1) The t-step value function of a policy V π t is defined as the expected future discounted reward the agent can gather by following π for next t steps. V π t can be characterized recursively using the Bellman equation: where a π = π(b) and V π 0 (b) = 0. The action-value function Q π t (b, a) is the value of taking action a and following π thereafter: The policy that maximizes V π t is called the optimal policy π * and the corresponding value function is called the optimal value function V * t . The optimal value function V * t (b) can be characterized recursively as: We can also define Bellman optimality operator B * : and write (4) as V * t (b) = (B * V * t−1 )(b). An important consequence of these equations is that the value function is piecewise-linear and convex (PWLC), as shown in Figure 1, a property exploited by most POMDP planners. Sondik (1971) showed that a PWLC value function at any finite time step t can be expressed as a set of vectors: Γ t = {α 0 , α 1 , . . . , α m }. Each α i represents an |S|dimensional hyperplane defining the value function over a bounded region of the belief space. The value of a given belief point can be computed from the vectors as:

POMDP Solvers
Exact methods like Monahan's enumeration algorithm (Monahan, 1982) computes the value function for all possible belief points by computing the optimal Γ t . Point-based planners (Pineau et al, 2006;Shani et al, 2012;Spaan and Vlassis, 2005), on the other hand, avoid the expense of solving for all belief points by computing Γ t only for a set of sampled beliefs B. Since exact POMDP solvers (Sondik, 1971;Monahan, 1982) are intractable for all but the smallest POMDPs, we focus on point-based methods here. Point-based methods compute Γ t using the following recursive algorithm. At each iteration (starting from t = 1), for each action a and observation z, Next, Γ a t is computed only for the sampled beliefs, i.e., Finally, the best α-vector for each b ∈ B is selected: The above algorithm at each timestep t, generates |A n ||Ω||Γ t−1 | alpha vectors in O(|S| 2 |A||Ω||Γ t−1 |) time and then reduces them to |B| vectors in O(|S||B||A||Ω||Γ t−1 |) (Pineau et al, 2006).

Active Perception POMDP
The goal in an active perception POMDP is to reduce uncertainty about a feature of interest that is not directly observable. In general, the feature of interest may be only part of the state, e.g., if a surveillance system cares only about people's positions, not their velocities, or higher-level features derived from the state. However, for simplicity, we focus on the case where the feature of interest is just the state s 2 of the POMDP. For simplicity, we also focus on pure active perception tasks in which the agent's only goal is to reduce uncertainty about the state, as opposed to hybrid tasks where the agent may also have other goals. For such cases, hybrid rewards (Eck and Soh, 2012), which combine the advantage of belief-based and state-based rewards, are appropriate. Although not covered in this article, it is straightforward to extend our results to hybrid tasks (Spaan et al, 2015).
We model the active perception task as a POMDP in which an agent must choose a subset of available sensors at each time step. We assume that all selected sensors must be chosen simultaneously, i.e. it is not possible within a timestep to condition the choice of one sensor on the observations generated by another sensor. This corresponds to the common setting where generating each sensor's observation is time consuming, e.g., in the surveillance task, because it requires applying expensive computer vision algorithms, and thus all the observations from the selected cameras must be generated in parallel. Formally, an active perception POMDP has the following components: To prevent ambiguity about which sensor generated which observation in z, we assume that, for all i and j, the domains of z i and z j share only ∅. This assumption is only made for notational convenience and does not restrict the applicability of our methods in any way.
For example, in the surveillance task, a indicates the set of cameras that are active and z are the observations received from the cameras in a. The model for the sensor selection problem for surveillance task is shown in Figure 2. Here, we assume that the actions involve only selecting K out of N sensors. The transition function is thus independent of the actions, as selecting sensors cannot change the state. However, as we outline in the later subsection (6.4), it is possible to extend our results to general active perception POMDPs with arbitrary transition functions, that can model, e.g., mobile sensors that, by moving, change the state.
A challenge in these settings is properly formalizing the reward function. Because the goal is to reduce the uncertainty, reward is a direct function of the belief, not the state, i.e., the agent has no preference for one state over another, so long as it knows what that state is. Hence, there is no meaningful way to define a state-based reward function R(s, a). Directly defining ρ(b, a) using, e.g., negative belief entropy: −H b (s) = s b(s) log(b(s)) results in a value function that is not piecewise-linear. Since ρ(b, a) is no longer a convex combination of a state-based reward function, it is no longer guaranteed to be PWLC, a property most POMDP solvers rely on. In the following subsections, we describe two recently proposed frameworks designed to address this problem.

ρPOMDPs
A ρPOMDP (Araya-López et al, 2010), defined by a tuple S, A, T, Ω, O, Γ ρ , b 0 , h , is a normal POMDP except that the state-based reward function R(s, a) has been omitted and Γ ρ has been added. Γ ρ is a set of vectors, that defines the immediate reward for ρPOMDP. Since we consider only pure active perception tasks, ρ depends only on b, not on a and can be written as ρ(b). Given Γ ρ , ρ(b) can be computed as:  max α∈Γρ s b(s)α(s). If the true reward function is not PWLC, e.g., negative belief entropy, it can be approximated by defining Γ ρ as a set of vectors, each of which is tangent to the true reward function. Figure 3 illustrates approximating negative belief entropy with different numbers of tangents. Solving a ρPOMDP 3 requires a minor change to the existing algorithms. In particular, since Γ ρ is a set of vectors, instead of a single vector, an additional crosssum is required to compute Γ a t : (2010) showed that the error in the value function computed by this approach, relative to the true reward function, whose tangents were used to define Γ ρ , is bounded. However, the additional cross-sum increases the computational complexity of computing Γ a t to O(|S||A||Γ t−1 ||Ω||B||Γ ρ |) with point-based methods.
Though ρPOMDP do not put any constraints on the definition of ρ, we restrict the definition of ρ for an active perception POMDP to be a set of vectors ensuring that ρ is PWLC, which in turn ensures that the value function is PWLC. This is not a severe restriction because solving a ρPOMDP using offline planning requires a PWLC approximation of ρ anyway.

POMDPs with Information Rewards
Spaan et al. proposed POMDPs with information rewards (POMDP-IR), an alternative framework for modeling active perception tasks that relies only on the standard POMDP. Instead of directly rewarding low uncertainty in the belief, the agent is given the chance to make predictions about the hidden state and rewarded, via a standard state-based reward function, for making accurate predictions. Formally, a POMDP-IR is a POMDP in which each action a ∈ A is a tuple a n , a p where a n ∈ A n is a normal action, e.g., moving a robot or turning on a camera (in our case a n is a), and a p ∈ A p is a prediction action, which expresses predictions about the state. The joint action space is thus the Cartesian product of A n and A p , i.e., A = A n × A p .
Prediction actions have no effect on states or observations but can trigger rewards via the standard statebased reward function R(s, a n , a p ). While there are many ways to define A p and R, a simple approach is to create one prediction action for each state, i.e., A p = S, and give the agent positive reward if and only if it correctly predicts the true state: Thus, POMDP-IR indirectly rewards beliefs with low uncertainty, since these enable more accurate predictions and thus more expected reward. Furthermore, since a state-based reward function is explicitly defined, ρ can be defined as a convex combination of R, as in (1), guaranteeing a PWLC value function, as in a regular POMDP. Thus, a POMDP-IR can be solved with standard POMDP planners. However, the introduction of prediction actions leads to a blowup in the size of the joint action space |A| = |A n ||A p | of POMDP-IR. Replacing |A| with |A n ||A p | in the analysis yields a complexity of computing Γ a t for POMDP-IR of O(|S||A n ||Γ t−1 ||Ω||B||A p |) for point-based methods.

S" S'"
A n " Note that, though not made explicit in Spaan et al (2015), several independence properties are inherent to the POMDP-IR framework, as shown in Figure 4. Specifically, the two important properties are (a) in our setting the reward function is independent of the normal actions; (b) the transition and the observation function are independent of the normal actions. Although POMDP-IR can model hybrid rewards, where in addition to prediction actions, normal actions can reward agent as well (Spaan et al, 2015), in this article, because we focus on pure active perception, the reward function R is independent of the normal actions. Furthermore, state transitions and observations are independent of the prediction actions. In Section 6, we introduce a new technique to show that these independence properties can be exploited to solve a POMDP-IR much more efficiently and thus avoid the blowup in the size of the action space caused by the introduction of the prediction actions. Although, the reward function in our setting is independent of the normal actions, the main results we present in this article are not dependent on this property and can be easily extended or applied to cases where the reward is dependent on the normal actions.

ρPOMDP and POMDP-IR Equivalence
ρPOMDP and POMDP-IR offer two perspectives on modeling active perception tasks. ρPOMDP starts from a "true" belief-based reward function such as the negative entropy and then seeks to find a PWLC approximation via a set of tangents to the curve. By contrast, POMDP-IR starts from the queries that the user of the system will pose, e.g., "What is the position of everyone in the room?" or "How many people are in the room" and creates prediction actions that reward the agent correctly for answering such queries. In this section we establish the relationship between these two frameworks by proving the equivalence of ρPOMDP and POMDP-IR. By equivalence of ρPOMDP and POMDP-IR, we mean that given a ρPOMDP and a policy, we can construct a corresponding POMDP-IR and a policy such that the value function for both the policies is exactly the same. We show this equivalence by starting with a ρPOMDP and a policy and introducing a reduction procedure for both ρPOMDP and the policy (and vice-versa). Using the reduction procedure, we reduce the ρPOMDP to a POMDP-IR and the policy for ρPOMDP to an equivalent policy for POMDP-IR. We then show that the value function, V π t for the ρPOMDP we started with and the reduced POMDP-IR is the same for the given and the reduced policy. To complete our proof, we repeat the same process by starting with a POMDP-IR and then reducing it to a ρPOMDP. We show that the value function V π t for the POMDP-IR and the corresponding ρPOMDP is the same.
via the following procedure.
-The set of states, set of observations, initial belief and horizon remain unchanged. Since the set of states remain unchanged, the set of all possible beliefs is also the same for M IR and M ρ . behave the same as in M ρ for each a n and ignore the a p , i.e., for all a n ∈ A n,IR : T IR (s, a n , s ) = T ρ (s, a, s ) and O IR (s , a n , z) = O ρ (s , a, z), where a ∈ A ρ corresponds to a n .
For example, consider a ρPOMDP with 2 states, if ρ is defined using tangents to belief entropy at b(s 1 ) = 0.3 and b(s 1 ) = 0.7. When reduced to a POMDP-IR, the resulting reward function gives a small negative reward for correct predictions and a larger one for incorrect predictions, with the magnitudes determined by the value of the tangents when b(s 1 ) = 0 and b(s 1 ) = 1: This is illustrated in Figure 3 (top).
Definition 2 Given a policy π ρ for a ρPOMDP, M ρ , the reduce-policy-ρ-IR(π ρ ) procedure produces a policy π IR for a POMDP-IR as follows. For all b, That is, π IR selects the same normal action as π ρ and the prediction action that maximizes expected immediate reward.
Using these definitions, we prove that solving M ρ is the same as solving M IR .
Theorem 1 Let M ρ be a ρPOMDP and π ρ an arbitrary policy for M ρ . Furthermore let M IR = reducepomdp-ρ-IR(M ρ ) and π IR = reduce-policy-ρ-IR(π ρ ). Then, for all b, where V IR t is the t-step value function for π IR and V ρ t is the t-step value function for π ρ .
Proof See Appendix.
-The set of states, set of observations, initial belief and horizon remain unchanged. Since the set of states remain unchanged, the set of all possible belief is also the same for M IR and M ρ . -The set of actions in M ρ is equal to the set of normal actions in M IR , i.e., A ρ = A n,IR . -The transition and observation functions in M ρ behave the same as in M IR for each a n and ignore the a p , i.e., for all a ∈ A ρ : T ρ (s, a, s ) = T IR (s, a n , s ) and O ρ (s , a, z) = O IR (s , a n , z) where a n ∈ A n,IR is the action corresponding to a ∈ A ρ .
Definition 4 Given a policy π IR = a n , a p for a POMDP-IR, M IR , the reduce-policy-IR-ρ(π IR ) procedure produces a policy π ρ for a POMDP-IR as follows. For all b, Theorem 2 Let M IR be a POMDP-IR and π IR = a n , a p a policy for M IR , such that a p = argmax a p b(s)R(s, a p ). Furthermore let M ρ = reducepomdp-IR-ρ(M IR ) and π ρ = reduce-policy-IRρ(π IR ). Then, for all b, where V IR t is the value of following π IR in M IR and V ρ t is the value of following π ρ in M ρ .
Proof See Appendix.
The main implication of these theorems is that any result that holds for either ρPOMDP or POMDP-IR also holds for the other framework. For example, the results presented in Theorem 4.3 in Araya-López et al (2010) that bound the error in the value function of ρPOMDP also hold for POMDP-IR. Furthermore, with this equivalence, the computational complexity of solving ρPOMDP and POMDP-IR comes out to be the same, since POMDP-IR can be converted into ρPOMDP (and vice-versa) trivially, without any significant blow-up in representation. Although we have proved the equivalence of ρPOMDP and POMDP-IR only for pure active perception task where the reward is solely conditioned on the belief, it is straightforward to extend it to hybrid active perception tasks, where the reward is conditioned both on belief and the state. Although, the resulting active perception POMDP for dynamic sensor selection is such that the action does not affect the state, the results from this section do not use that property at all and thus are valid for active perception POMDPs where an agent might take an action which can affect the state in the next time step.

Decomposed Maximization for POMDP-IR
The POMDP-IR framework enables us to formulate uncertainty as an objective, but it does so at the cost of additional computations, as adding prediction actions enlarges the action space. The computational complexity of performing a point-based backup In this section, we present a new technique that exploits the independence properties of POMDP-IR, mainly that the transition function and the observation function are independent of the prediction actions, to reduce the computational costs. We also show that the same principle is applicable to ρPOMDPs.
The increased computational cost of solving POMDP-IR arises from the size of the action space, |A n ||A p |. However, as shown in Figure 4, prediction actions only affect the reward function and normal actions only affect the observation and transition function. We exploit this independence to decompose the maximization in the Bellman optimality equation: These decomposition can be exploited by pointbased methods by computing Γ a,z t only for normal actions, a n and α ap only for prediction actions. That is, (5) can be changed to: For each prediction action, we compute the vector specifying the immediate reward for performing the prediction action in each state: The next step is to modify (6) to separately compute the vectors maximizing expected reward induced by prediction actions and the expected return induced by the normal action: By decomposing the maximization, this approach avoids iterating over all |A n ||A p | joint actions. At each timestep t, this approach generates |A n ||Ω||Γ t−1 |+|A p | backprojections in O(|S| 2 |A n ||Ω||Γ t−1 | + |S||A p |) time and then prunes them to |B| vectors, with a computational complexity of O(|S||B|(|A p | + |A n ||Γ t−1 ||Ω|)).
The same principle can be applied to ρPOMDP by changing (6) such that it maximizes over immediate reward independently from the future return: The computational complexity of solving ρPOMDP with this approach is O(|S| 2 |A||Ω||Γ t−1 | + |S||Γ ρ |) + O(|S||B|(|Γ ρ | + |A||Γ t−1 ||Ω|). Thus, even though both POMDP-IR and ρPOMDP use extra actions or vectors to formulate belief-based rewards, they can both be solved at only minimal additional computational cost.

Greedy PBVI
The previous sections allow us to model the active perception task efficiently, such that the PWLC property of the value function is maintained. Thus, we can now directly employ traditional POMDP solvers that exploit this property to compute the optimal value function V * t .While point-based methods scale better in the size of the state space, they are still not practical for our needs as they do not scale in the size of the normal action space of active perception POMDPs.
While the computational complexity of one iteration of PBVI is linear in the size of the action space |A| of a POMDP, for an active perception POMDP, the action space is modeled as selecting K out of the N available sensors, yielding |A| = N K . For fixed K, as the number of sensors N grows, the size of the action space and the computational cost of PBVI grows exponentially with it, making use of traditional POMDP solvers infeasible for solving active perception POMDPs.
In this section, we propose greedy PBVI, a new point-based planner for solving active perception POMDPs which scales much better in the size of the action space. To facilitate the explication of greedy PBVI, we now present the final step of PBVI, described earlier in (7) and (8), in a different way. For each b ∈ B, and a ∈ A, we must find the best α a b ∈ Γ a t .
and simultaneously record its value Q(b, a) = s α a, * b b(s). Then, for each b we find the best vector across all actions: The main idea of greedy PBVI is to exploit greedy maximization , an algorithm that operates on a set function Q : 2 X → R. Greedy maximization is much faster than full maximization as it avoids going over the N K choices and instead constructs a subset of K elements iteratively. Thus, we replace the maximization operator in the Bellman optimality equation with greedy maximization. Algorithm 1 shows the argmax variant, which constructs a subset Y ⊆ X of size K by iteratively adding elements of X to Y . At each iteration, it adds the element that maximally increases marginal gain ∆ Q (e|a) of adding a sensor e to a subset of sensors a: To exploit greedy maximization in PBVI, we need to replace an argmax over A with greedy-argmax.
Algorithm 1 greedy-argmax(Q, X, K) Our alternative description of PBVI above makes this straightforward: (17) contains such an argmax and Q(b, .) has been intentionally formulated to be a set function over A + . Thus, implementing greedy PBVI requires only replacing (17) with: Since the complexity of greedy-argmax is only O(|N ||K|), the complexity of greedy PBVI is only O(|S||B||N ||K||Γ t−1 |) (as compared to O(|S||B| n k ) for traditional PBVI for computing Γ a t ). Using point-based methods as a starting point is essential to our approach. Algorithms like Monahan's enumeration algorithm (Monahan, 1982) that rely on pruning operations to compute V * instead of performing an explicit argmax, cannot directly use greedy-argmax. Thus, it is precisely because PBVI operates on a finite set of beliefs that an explicit argmax is performed, opening the door to using greedy-argmax instead.

Bounds given submodular value function
In the following subsections, we present the highlights of the theoretical guarantees associated with greedy PBVI. The detailed analysis can be found in the appendix. Specifically, we show that a value function computed by greedy PBVI is guaranteed to have bounded error with respect to the optimal value function under submodularity, a property of set functions that formalizes the notion of diminishing returns. Then, we establish the conditions under which the value function of a POMDP is guaranteed to be submodular. We define ρ(b) as negative belief entropy, ρ(b) = −H b (s) to establish the submodularity of value function. Both ρPOMDP and POMDP-IR approximate ρ(b) with tangents. Thus, in the last subsection, we show that even if belief entropy is approximated using tangents, the value function computed by greedy PBVI is guaranteed to have bounded error with respect to the optimal value function.
Submodularity is a property of set functions that corresponds to diminishing returns, i.e., adding an element to a set increases the value of the set function by a smaller or equal amount than adding that same element to a subset. In our notation, this is formalized as follows. Given a policy π, the set function Q π t (b, a) is submodular in a, if for every a M ⊆ a N ⊆ A + and a e ∈ A + \ a N , Equivalently, Submodularity is an important property because of the following result: Theorem 3  If Q π t (b, a) is non-negative, monotone and submodular in a, then for all b, where a G = greedy-argmax(Q π t (b, ·), A + , K) and a * = argmax a∈A Q π t (b, a).
Theorem 3 gives a bound only for a single application of greedy-argmax, not for applying it within each backup, as greedy PBVI does. In this subsection, we establish such a bound. Let the greedy Bellman operator B G be: where max G a refers to greedy maximization. This immediately implies the following corollary to Theorem 3: Corollary 1 Given any policy π, if Q π t (b, a) is nonnegative, monotone, and submodular in a, then for all b, Next, we define the greedy Bellman equation: is the true value function obtained by greedy maximization, without any point-based approximations. Using Corollary 1, we can bound the error of V G with respect to V * .
Theorem 4 If for all policies π, Q π t (b, a) is nonnegative, monotone and submodular in a, then for all b, Proof See Appendix.
Theorem 4 extends Nemhauser's result to a full sequential decision making setting where multiple application of greedy maximization are employed over multiple time steps. This theorem gives a theoretical guarantee on the performance of greedy PBVI. Given a POMDP with a submodular value function, greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. Moreover, this performance comes at a computational cost that is much less than that of solving the same POMDP with traditional solvers. Thus, greedy PBVI scales much better in the size of the action space of active perception POMDPs, while still retaining bounded error.
The results presented in this subsection are applicable only if the value function for a POMDP is submodular. In the following subsections, we establish the submodularity of value function for active perception POMDP under certain conditions.

Submodularity of value functions
The previous subsection showed that the value function computed by greedy PBVI is guaranteed to have bounded error as long as it is non-negative, monotone and submodular. In this subsection, we establish sufficient conditions for these properties to hold. Specifically, we show that, if the belief-based reward is negative entropy, i.e., ρ(b) = −H b (s) + log( 1 |S| ) then under certain conditions Q π t (b, a) is submodular, non-negative and monotone as required by Theorem 4. We point out that the second part, log( 1 |S| ) is only required (and sufficient) to guarantee non-negativity, but is independent of the actual beliefs or actions. For the sake of conciseness, in the remainder of this paper we will omit this term.
We start by observing that is the expected immediate reward with k steps to go, conditioned on the belief and action with t steps to go and assuming policy π is followed after timestep t: where z t:k is a vector of observations received in the interval from t steps to go to k steps to go, b t is the belief at t steps to go, a t is the action taken at t steps to go, and ρ(b k ) = −H b k (s k ), where s k is the state at k steps to go. To show that Q π t (b, a) is submodular the main condition is conditional independence as defined below: Definition 5 The observation set z is conditionally independent given s if any pair of observation features are conditionally independent given the state, i.e., Pr(z i , z j |s) = Pr(z i |s) Pr(z j |s), ∀z i , z j ∈ z. (24) Using above definition, the submodularity of Q(b, a) can be established as: Theorem 5 If z t:k is conditionally independent given s k and ρ(b) = −H b (s), then Q π t (b, a) is submodular in a, for all π.
Proof See Appendix.
Theorem 6 If z t:k is conditionally independent given s k , V π t is convex over the belief space for all t, π, and Proof See Appendix.
In this subsection we showed that if the immediate belief-based reward ρ(b) is defined as negative belief entropy, then the value function of an active perception POMDP is guaranteed to be submodular under certain conditions. However, as mentioned earlier, to solve active perception POMDP, we approximate the belief entropy with vector tangents. This might interfere with the submodularity of the value function. In the next subsection, we show that, even though the PWLC approximation of belief entropy might interfere with the submodularity of the value function, the value function computed by greedy PBVI is still guaranteed to have bounded error.

Bounds given approximated belief entropy
While Theorem 6 bounds the error in V G t (b), it does so only on the condition that ρ(b) = −H b (s). However, as discussed earlier, our definition of active perception POMDPs instead defines ρ using a set of vectors Γ ρ = {α ρ 1 , . . . , α ρ m }, each of which is a tangent to −H b (s), as suggested by Araya-López et al (2010), in order to preserve the PWLC property. While this can interfere with the submodularity of Q π t (b, a), here we show that the error generated by this approximation is still bounded in this case.
Letρ(b) denote the PWLC approximated entropy andṼ * t denote the optimal value function when using a PWLC approximation to negative entropy for the belief-based reward, as in an active perception POMDP, i.e., Araya-López et al (2010) showed that, if ρ(b) verifies the α-Hölder condition (Gilbarg and Trudinger, 2001), a generalization of the Lipschitz condition, then the following relation holds between V * t andṼ * t : where V * t is the optimal value function with ρ(b) = −H b (s), δ is the density of the set of belief points at which tangent are drawn to the belief entropy, and C is a constant.
LetṼ G t (b) be the value function computed by greedy PBVI when immediate belief-based reward isρ(b): then the error betweenṼ G t (b) and V * t (b) is bounded as stated in the following theorem.

Theorem 7 For all beliefs, the error betweenṼ
is convex in the belief space for all π, t, and if z t:k is conditionally independent given s k .
Proof See Appendix.
In this subsection we showed that if the negative entropy is approximated using tangent vectors, greedy PBVI still computes a value function that has bounded error. In the next subsection we outline how greedy PBVI can be extended to general active perception tasks.

General Active Perception POMDPs
The results presented in this section apply to the active perception POMDP in which the evolution of the state over time is independent of the actions of the agent. Here, we outline how these results can be extended to general active perception POMDPs without many changes. The main application for such an extension is in tasks involving a mobile robot coordinating with sensors to intelligently take actions to perceive its environment. In such cases, the robot's actions, by causing it to move, can change the state of the world.
The algorithms we proposed can be extended to such settings by making small modifications to the greedy maximization operator. The greedy algorithm can be run for K + 1 iterations and in each iteration the algorithm would choose to add either a sensor (only if fewer than K sensors have been selected), or a movement action (if none has been selected so far). Formally, using the work of Fisher et al (1978), which extends that of Nemhauser et al (1978) on submodularity to combinatorial structures such as matroids, the action space of a POMDP involving a mobile robot can be modeled as a partition matroid and greedy maximization subject to matroid constraints  can be used to maximize the value function approximately.
The guarantees associated with greedy maximization subject to matroid constraints  can then be used to bound the error of greedy PBVI. However, deriving exact theoretical guarantees for greedy PBVI for such tasks is beyond the scope of this article. Assuming that the reward function is still defined as the negative belief entropy, the submodularity of such POMDPs still holds under the conditions mentioned in Section 6.2.
In this subsection, we presented greedy PBVI, which uses greedy maximization to improve the scalability in the action space of an active perception POMDP. We also showed that, if the value function of an active perception POMDP is submodular, then greedy PBVI computes a value function that is guaranteed to have bounded error. We established that if the belief-based reward is defined as the negative belief entropy, then the value function of an active perception POMDP is guaranteed to be submodular. We showed that if the negative belief entropy is approximated by tangent vectors, as is required to solve active perception POMDPs efficiently, greedy PBVI still computes a value function that has bounded error. Finally, we outlined how greedy PBVI and the associated theoretical bounds can be extended to general active perception POMDPs.

Experiments
In this section, we present an analysis of the behavior and performance of belief-based rewards for active perception tasks, which is the main motivation of our work. We present the results of experiments designed to study the effect on the performance of the choice of prediction actions/tangents, and compare the costs and benefits of myopic versus non-myopic planning. We consider the task of tracking people in a surveillance area with a multi-camera tracking system. The goal of the system is to select a subset of cameras, to correctly predict the position of people in the surveillance area, based on the observations received from the selected cameras. In the following subsections, we present results on real-data collected from a multi-camera system in a shopping mall and we present the experiments comparing performance of greedy PBVI to PBVI.
We compare the performance of POMDP-IR with decomposed maximization to a naive POMDP-IR that We model this task as a POMDP with one state for each cell. Thus the person can move among |S| cells. Each cell is adjacent to two other cells and each cell is monitored by a single camera. Thus, in this case there are N = |S| cameras. At each time step, the person can stay in the same cell as she was in the previous time step with probability p or she can move to one of the neighboring cells with equal probability. The agent must select K out of N cameras and the task is to predict the state of the person correctly using noisy observations from the K cameras. There is one prediction action for each state and the agent gets a reward of +1 if it correctly predicts the state and 0 otherwise. An observation is a vector of N observation features, each of which specifies the person's position as estimated by the given camera. If a camera is not selected, then the corresponding observation feature has a value of null.
does not decompose the maximization. Thanks to Theorems 1 and 2, these approaches have performance equivalent to their ρPOMDP counterparts. We also compare against two baselines. The first is a weak baseline we call the rotate policy in which the agent simply keeps switching between cameras on a turn-by-turn basis. The second is a stronger baseline we call the coverage policy, which was developed in earlier work on active perception (Spaan, 2008;Spaan and Lima, 2009). The coverage policy is obtained after solving a POMDP that rewards the agent for observing the person, i.e., the agent is encouraged to select the cameras that are most likely to generate positive observations. Thanks to the decomposed maximization, the computational cost of solving for the coverage policy and belief-based rewards is the same.

Simulated Setting
We start with experiments conducted in a simulated setting, first considering the task of tracking a single person with a multi-camera system and then considering the more challenging task of tracking multiple people.

Single-Person Tracking
We start by considering the task of tracking one person walking in a grid-world composed of |S| cells and N cameras as shown in Figure 5. At each timestep, the agent can select only K cameras, where K ≤ N . Each selected camera generates a noisy observation of the person's location. The agent's goal is to minimize its uncertainty about the person's state. In the experiments in this section, we fixed K = 1 and N = 10. The problem setup and the POMDP model is shown and described in Figure 5.
To compare the performance of POMDP-IR to the baselines, 100 trajectories were simulated from the POMDP. The agent was asked to guess the person's position at each time step. Figure 6(a) shows the cumulative reward collected by all four methods. POMDP-IR with decomposed maximization and naive POMDP-IR perform identically as the lines indicating their respective performance lie on top of each other in figure 6(a). However, Figure 6(b), which compares the runtimes of POMDP-IR with decomposed maximization and naive POMDP-IR, shows that decomposed maximization yields a large computational savings. Figure  6(a) also shows that POMDP-IR greatly outperforms the rotate policy and modestly outperforms the coverage policy.
Figures 6(c) and 6(d) illustrate the qualitative difference between POMDP-IR and the coverage policy. The blue lines mark the points in trajectory when the agent selected the camera that observes the person's location. If the agent selected a camera such that the person's location is not covered then the blue vertical line is not there at that point in the trajectory in the figure. The agent has to select one out of N cameras and does not have an option of not selecting any camera. The red line plots the max of the agent's belief. The main difference between the two policies is that once POMDP-IR gets a good estimate of the state, it proactively observes neighboring cells to which the person might transition. This helps it to more quickly find the person when she moves. By contrast, the coverage policy always looks at the cell where it believes her to be. Hence, it takes longer to find her again when she moves. This is evidenced by the fluctuations in the max of the belief, which often drops below 0.5 for the coverage policy but rarely does so for POMDP-IR. The presence of false positives and negatives can also be seen in the figure, when max of the belief goes down even though the agent selected the camera which can observe the person's location and in some cases even though the agent did not select the camera which can observe the person's location but still the max of belief shoots up.
Next, we examine the effect of approximating a true reward function like belief entropy with more and more tangents. Figure 3 illustrates how adding more tangents can better approximate negative belief entropy. To test the effects of this, we measured the cumulative reward when using between one and four tangents per state. Figure 7 shows the results and demonstrates that, as more tangents are added, the performance improves. However, performance also quickly saturates, as four tangents perform no better than three.
Next, we compare the performance of POMDP-IR to a myopic variant that seeks only to maximize immediate reward, i.e., h = 1. We perform this comparison in three variants of the task. In the highly static variant, the state changes very slowly: the probability of staying is the same state is 0.9. In the moderately dynamic variant, the state changes more frequently, with a same-state transition probability of 0.7. In the highly dynamic variant, the state changes rapidly (with a same-state transition probability of 0.5). Figure 8 (top) shows the results of these comparisons. In each setting, non-myopic POMDP-IR outperforms myopic POMDP-IR. In the highly static variant, the difference is marginal. However, as the task becomes more dynamic, the importance of look-ahead planning grows. Because the myopic planner focuses only on immediate reward, it ignores what might happen to its belief when the state changes, which happens more often in dynamic settings.
We also compare the performance of myopic and non-myopic planning in a budget-constrained environment. This specifically corresponds to an energy con- strained environment, where cameras can be employed only a few times over the entire trajectory. This is augmented with resource constraints, so that the agent has to plan not only when to use the cameras, but also decide which camera to select. Specifically, the agent can only employ the multi-camera system a total of 15 times across all 50 timesteps and the agent can select which camera (out of the multi-camera system) to employ at each of the 15 instances. On the other timesteps, it must select an action that generates only a null observation. Figure 8 (bottom) shows that non-myopic planning is of critical importance in this setting. Whereas myopic planning greedily consumes the budget as quickly as possible, thus earning more reward in the beginning, non-myopic planning saves the budget for situations in which it is highly uncertain about the state.
Finally, we compare the performance of myopic and non-myopic planning when the multi-camera system can communicate with a mobile robot that also has sensors. This setting is typical of a networked robot system (Spaan et al, 2010) in which a robot coordinates with a multi-camera system to perform surveillance of a building, detect any emergency situations like fire, or help people navigate to their destination. Here, the task is to minimize uncertainty about the location of one person who is moving in the space monitored by the robot and the cameras. The robot's sensors are assumed to be more accurate than the stationary cameras. Specifically, the sensors attached to the robot can detect if a person is in the current cell with 90% accuracy compared to  Fig. 9 Performance comparison for myopic vs. non myopic policies when camera system is assisting a moving robot.
the stationary cameras, each of which has an accuracy of 75% of detecting a person in the cell it observes. The robot's sensor can observe the presence or absence of a person only for the cell that the robot occupies. In addition to using its sensors to generate observations about its current cell, the robot can also move forward or backward to an adjacent cell or choose to stay at the current cell. To model this task, the action vector introduced earlier is augmented with another action feature that indicates the direction of the robot's motion, which can take three values: forward, backward or stay.
Performance is quantified as the total number of times the correct location of the person is predicted by the system. Figure 9, which shows the performance of myopic and non-myopic policies for this task, demonstrates that when planning non-myopically the agent is able to utilize the accurate sensors more effectively as to compared to when planning myopically.

Multi-Person Tracking
To extend our analysis to a more challenging problem, we consider a simulated setting in which multiple people must be tracked simultaneously. Since |S| grows exponentially in the number of people, the resulting POMDP quickly becomes intractable. Therefore, we compute instead a factored value function where V i t (b i ) is the value of the agent's current belief b i about the i-th person. Thus, V i t (b i ) needs to be computed only once, by solving a POMDP of the same size as that in the single-person setting. During action selection, V t (b) is computed using the current b i for each person. This kind of factorization corresponds to the assumption that each person's movement and observations is independent of that of other people. Although violated in practice, such an assumption can nonetheless yield good approximations.  Finally, we compare POMDP-IR and the coverage policy in a setting in which the goal is only to reduce uncertainty about a set of "important cells" that are a subset of the whole state space. For POMDP-IR, we prune the set of prediction actions to allow predictions only about important cells. For the coverage policy, we reward the agent only for observing people in important cells. The results, shown in Figure 10 (bottom), demonstrate that the advantage of POMDP-IR over the coverage policy is even larger in this variant of the task. POMDP-IR makes use of information coming from cells that neighbor the important cells (which is of critical importance if the important cells do not have good observability), while the coverage policy does not. As before, the difference gets larger as the number of people increases.

Real Data
Finally, we extended our analysis to a real-life dataset collected in a shopping mall. This dataset was gath- ered over 4 hours using 13 CCTV cameras located in a shopping mall (Bouma et al, 2013). Each camera uses a FPDW (Dollar et al, 2010) pedestrian detector to detect people in each camera image and in-camera tracking (Bouma et al, 2013) to generate tracks of the detected people's movements over time.
The dataset consists of 9915 tracks each specifying one person's x-y position over time. Figure 11 shows the sample tracks from all of the cameras.
To learn a POMDP model from the dataset, we divided the continuous space into 20 cells (|S| = 21: 20 cells plus an external state indicating the person has left the shopping mall). Using the data, we learned a maximum-likelihood tabular transition function. However, we did not have access to the ground truth of the observed tracks so we constructed them using the overlapping regions of the camera.
Because the cameras have many overlapping regions (see Figure 11), we were able to manually match tracks of the same person recorded individually by each camera. The "ground truth" was then constructed by taking a weighted mean of the matched tracks. Finally, this ground truth was used to estimate noise parameters for each cell (assuming zero-mean Gaussian noise), which was used as the observation function. Figure 12 shows that, as before, POMDP-IR substantially outperforms the coverage policy for various numbers of cameras. In addition to the reasons mentioned before, the high overlap between the cameras contributes to POMDP-IR's superior performance. The coverage policy has difficulty ascertaining people's exact locations because it is rewarded only for observing them somewhere in a camera's large overlapping region, whereas POMDP-IR is rewarded for deducing their exact locations.

Greedy PBVI
To empirically evaluate greedy PBVI, we tested it on the problem of tracking either one or multiple people using a multi-camera system. The reward function is described as a set of |S| vectors, Γ ρ = {α 1 . . . α |S| }, with α i (s) = 1 if s = i and α i (s) = 0 otherwise. The initial belief is uniform across all states. We planned for horizon h = 10 with γ = 0.99.
As baselines, we tested against regular PBVI and myopic versions of both greedy and regular PBVI that compute a policy assuming h = 1 and use it at each timestep. Figure 13 shows runtimes under different values of N and K. Since multi-person tracking uses the value function obtained by solving a single-person POMDP, single and multi-person tracking have the same runtimes. These results demonstrate that greedy PBVI requires only a fraction of the computational cost of regular PBVI. In addition, the difference in the runtime grows quickly as the action space gets larger: for N = 5 and K = 2 greedy PBVI is twice as fast, while for N = 11, K = 3 it is approximately nine times as fast. Thus, greedy PBVI enables much better scalability in the action space. Figure 14, which shows the cumulative reward under different values of N and K for single-person (top) and multi-person (bottom) tracking, verifies that greedy PBVI's speedup does not come at the expense of performance, as greedy PBVI accumulates nearly as much reward as regular PBVI. They also show that both PBVI and greedy PBVI benefit from non-myopic planning. While the performance advantage of non-myopic planning is relatively modest, it increases with the number of cameras and people, which suggests that non-myopic planning is important to making active perception scalable.
Furthermore, an analysis of the resulting policies showed that myopic and non-myopic policies differ qualitatively. A myopic policy, in order to minimize uncertainty in the next step, tends to look where it believes the person to be. By contrast, a non-myopic policy tends to proactively look where the person might go next, so as to more quickly detect her new location when she moves. Consequently, non-myopic policies exhibit less fluctuation in belief and accumulate more reward, as illustrated in Figure 15. The blue lines mark when the agent chooses the camera that can observe the cell occupied by the person. The red line plots the max of the agent's belief. The difference in fluctuation in belief is evident, as the max of the belief often drops below 0.5 for the myopic policy but rarely does so for the non-myopic policy.

Discussion & Conclusions
In this article, we addressed the problem of active perception, in which an agent must take actions to reduce uncertainty about a hidden variable while reasoning about various constraints. Specifically, we modeled the task of surveillance with multi-camera tracking sys- tems in large urban spaces as an active perception task. Since the state of the environment is dynamic, we model this task as a POMDP to compute closed-loop nonmyopic policies that can reason about the long-term consequences of selecting a subset of sensors.
Formulating uncertainty reduction as an end in itself is a challenging task, as it breaks the PWLC property of the value function, which is imperative for solving POMDPs efficiently. ρPOMDP and POMDP-IR are two frameworks that allow formulating uncertainty reduction as an end in itself and does not break the PWLC property.
We showed that ρPOMDP and POMDP-IR are two equivalent frameworks for modeling active perception task. Thus, results that apply to one framework are also applicable to the other. While ρPOMDP does not restrict the definition of ρ to a PWLC function, in this work we restrict the definition of ρPOMDP to a case where ρ is approximated with a PWLC function, as it is not feasible to efficiently solve a ρPOMDP where the ρ is not a PWLC function.
We model the action space of the active perception POMDP as selecting K out of N sensors, where K is the maximum number of sensors allowed by the resource constraints. Recent POMDP solvers enable scalability in the state space. However, for active perception, as the number of sensors grow, the action space grows exponentially. We proposed greedy PBVI, a POMDP planning method, that improves scalability in the action space of a POMDP. While we do not directly ad-dress the scaling in the observation space, we believe recent ideas on factorization of observation space (Veiga et al, 2014) can be combined with our approach to improve scalability in state, action and observation space to solve active perception POMDPs.
By leveraging the theory of submodularity, we showed that the value function computed by greedy PBVI is guaranteed to have bounded error. Specifically, we extend Nemhauser's result on greedy maximization of submodular functions to long-term planning. To apply these results to the active perception task, we showed that under certain conditions the value function of an active perception POMDP is submodular. One such condition requires that the series future of observations be independent of each other given the state. While this is a strong condition, it is only a sufficient condition and not may not be a necessary one. Thus, one line of future work is to attempt to relax this condition for proving the submodularity of the value function. Finally, we showed that, even with a PWLC approximation to the true value function, which is submodular, the error in the value function computed by greedy PBVI remains bounded, thus enabling us to compute efficiently value functions for active perception POMDP.
Greedy PBVI is ideally suited for active perception POMDPs for which the value function is submodular. However, in real-life situations submodularity of value function might not always hold. For example, in our setting when there is occlusion, it is possible for combinations of sensors that when selected together yield higher utility than the sum of their utilities when selected individually. Similar case can arise when a mobile robots is trying to sense the best point of view to observe a scene that is occluded. Thus in cases like this, greedy PBVI might not return the best solution.
Our empirical analysis established the critical factors involved in the performance active perception tasks. We showed that a belief-based formulation of uncertainty reduction beats a corresponding popular state-based reward baseline as well as other simple policies. While, the non-myopic policy beats the myopic one, the gain in certain cases the gain is marginal. However, in cases involving mobile sensors and budgeted constraints, non-myopic policies become critically important. Finally, experiments on a real-world dataset showed that the performance of greedy PBVI is similar to the existing methods but requires only a fraction of the computational cost, leading to much better scalability for solving active perception tasks.

Results from Section 4
Theorem 1 Let M ρ be a ρPOMDP and π ρ an arbitrary policy for M ρ . Furthermore let M IR = reducepomdp-ρ-IR(M ρ ) and π IR = reduce-policy-ρ-IR(π ρ ). Then, for all b, where V IR t is the t-step value function for π IR and V ρ t is the t-step value function for π ρ .
Proof By induction on t. To prove the base case, we observe that, from the definition of ρ(b), Since M IR has a prediction action corresponding to each α ap ρ , thus the a p corresponding to α = argmax α ap ρ ∈Γρ s b(s)α ap ρ (s), must also maximize s b(s)R(s, a p ). Then, For the inductive step, we assume that V IR t−1 (b) = V ρ t−1 (b) and must show that where π n IR (b) denotes the normal action of the tuple specified by π IR (b) and: Using the reduction procedure, we can replace T IR and O IR and π n IR (b) with their ρPOMDP counterparts on right hand side of the above equation: Similarly, for the belief update equation, Substituting the above result in (32) yields: V IR t (b) = max ap s b(s)R(s, a p ) + z P r(z|b, π ρ (b))V IR t−1 (b πρ(b),z ).
Since the inductive assumption tells us that V IR t−1 (b) = V ρ t−1 (b) and (31) shows that ρ(b) = max ap s b(s)R(s, a p ): Theorem 2 Let M IR be a POMDP-IR and π IR = a n , a p an policy for M IR , such that a p = max a p b(s)R(s, a p ). Furthermore let M ρ = reducepomdp-IR-ρ(M IR ) and π ρ = reduce-policy-IRρ(π IR ). Then, for all b, where V IR t is the value of following π IR in M IR and V ρ t is the value of following π ρ in M ρ .
Proof By induction on t. To prove the base case, we observe that, from the definition of ρ(b), For the inductive step, we assume that V ρ t−1 (b) = V IR t−1 (b) and must show that where π n IR (b) denotes the normal action of the tuple specified by π IR (b) and: P r(z|b, π ρ (b)) = s s O ρ (s , π ρ (b), z)T ρ (s, π ρ (b), s )b(s).
Similarly, for the belief update equation, Substituting the above result in (38) yields: Since the inductive assumption tells us that V ρ t−1 (b) = V IR t−1 (b) and (37) shows that max ap s b(s)R(s, a p ) = ρ(b): 10.2 Results from subsection 6.1 The following Lemma proves that the error in the value function remains bounded after application of B G .
Lemma 2 If ρ(b) = −H b (s), then the expected reward at each time step equals the negative discounted conditional entropy of b k over s k given z t:k : G π k (b t , a t ) = −γ t−k (H b k (s k |z t:k , a t )) = −γ t−k (H a t b k (s k |z t:k )) ∀ π.
Proof To prove the above lemma, we take help of some additional notations and definitions, first we must elaborate on the definition of b k : b k (s k ) P r(s k |b t , a t , π, z t:k ) = P r(z t:k ,s k |b t ,a t ,π) P r(z t:k |b t ,a t ,π) . (52) For notational convenience, we also write this as: b k (s k ) P r π b t ,a t (z t:k , s k ) P r π b t ,a t (z t:k ) .
The entropy of b k is thus: a t (z t:k , s k ) P r π b t ,a t (z t:k ) log( P r π b t ,a t (z t:k , s k ) P r π b t ,a t (z t:k ) ), and the conditional entropy of b k over s k given z t:k is: H a t b k (s k |z t:k ) = s k z t:k P r π b t ,a t (z t:k , s k ) log( P r π b t ,a t (z t:k ,s k ) P r π b t ,a t (z t:k ) ).
Then, by definition of G π k (b t , a t ), By definition of entropy, = γ t−k z t:k P r π b t ,a t (z t:k ) s k P r π b t ,a t (z t:k ,s k ) P r π b t ,a t (z t:k ) log( P r π b t ,a t (z t:k ,s k ) P r π b t ,a t (z t:k ) ) = γ t−k z t:k s k P r π b t ,a t (z t:k , s k ) log( P r π b t ,a t (z t:k , s k ) P r π b t ,a t (z t:k ) ) By definition of conditional entropy, = γ t−k (−H a t b k (s k |z t:k )).
Lemma 3 If z is conditionally independent given s then −H(s|z) is submodular in z, i.e., for any two observations z M and z N ,