Adaptive spatial discretization using reinforcement learning

A well-known challenge for deformation monitoring is the spatial discretization, i.e. the choice of monitoring points at which measurements are to be taken. Well-chosen monitoring points employ prior knowledge to yield a significant amount of information about a certain aspect of the monitored object. However, the choice of such a set of points is typically made to be practically expedient or left to the measurement instrument itself. We aim to derive adaptive discretization strategies that implicitly incorporate domain knowledge about the monitored object via a cycle of interaction and learning. In those strategies, previous measurements impact the locations of subsequent ones. We formulate the choice of monitoring points as a decision theoretical problem and review the framework of reinforcement learning which formalizes the problem of deriving optimal sequential decisions under uncertainty. Iterative algorithms produce solution schemes for this optimal control task. We benchmark the performance of reinforcement learning and compare its results to random, pseudorandom, and numerically designed discretization strategies on several geodetically motivated examples. Advantages, disadvantages, and practical feasibility of the approach are evaluated and reveal a significant boost in efficiency of the data collection scheme compared to classical approaches.


Motivating spatial discretization
Monitoring of infrastructure and natural phenomena often aims to detect changes that indicate the onset of potentially hazardous situations whose impact has to be controlled. This requires measurement systems being able to gather information relevant to the monitored object's risk assessment. The practical execution of geodetic measurement techniques like tacheometry or levelling requires a sequence of choices regarding the locations at which the monitored object M is to be sampled. These choices directly impact the capability of the observations to provide meaningful Jemil Avers Butt jemil.butt@geod.baug.ethz.ch;jemil.butt@atlasoptimization.ch 1 ETH Zürich, Institute of Geodesy and Photogrammetry, Stefano-Franscini-Platz 5, CH-8093, Zürich, Switzerland 2 Atlas Optimization GmbH, Zürich, Naglerwiesenstrasse 50, CH-8049, Zürich, Switzerland information. Clearly, the performance of discretization schemes depends on the shape and underlying dynamics of M , measurement uncertainties and the type of information the observations are supposed to provide. Figure 1 illustrates the challenge of developing such discretization schemes using the example of a three point bending test-a simple mechanical setup featuring a beam, a load, and displacements to be gauged. An analysis of this system can be found, e.g. in (Dietmar et al. 2011, p. 138). The deformation is known to be piecewise polynomial of degree 3. An effective approach for measuring the maximum deformation consists in successively sampling locations allowing for the determination of any unknown parameters of the polynomial. The inferred functional form of the deformation implies the likely location of the maximum deformation that subsequently can be sampled.
In general, one may expect optimal discretization schemes to be adaptive in the sense of the best sampling location x k depending on previous locations x 1 , . . . , x k−1 and corresponding observations y 1 , . . . , y k−1 . We will call this the adaptive aspect of the discretization problem and distinguish from it the a priori aspect of the problem under which we subsume the necessity to impart domain specific knowledge into the solution. Fig. 1 The deformation y of a simple beam setup as a function of the spatial variable x . If the location x F of the force F is random, distributing measurement locations to identify the maximum deformation is nontrivial

State of the art
Pinpointing to an exact state of the art proves difficult as variants of the discretization problem have been treated for different reasons in many disciplines. The optimization and design of geodetic networks as laid out in Grafarend and Sansò (2011) investigates the optimal choice of measurement locations with optimality defined in terms of spectral properties of covariance matrices. The approaches mostly rely on iteratively solving quadratic programming problems. Advances in convex optimization allow constrained minimization of traces, determinants, and eigenvalues of covariance matrices in form of semidefinite or maxdet programs (see Vandenberghe et al., 1998) with performant open-source solvers becoming more accessible in recent years (Anirudha et al. 2020). This implies that for certain measures of optimality, a solution to the purely a priori aspect of the discretization problem can be found by employing reliable numerical techniques. These formulations minimize the spread of residuals in the presence of a stochastic model and the absence of any data. The situation is similar for discretization strategies reliant on techniques from numerical approximation in which discretization locations are chosen to guarantee error free reconstructability for a specific family of functions (see, e.g. Golub and Welsch (1969). The non-adaptive approaches mentioned above are prevalent in geodetic network design (Grafarend and Sansò 2011), the design of geostatistical sampling patterns (Angulo et al. 2005), pharmaceutical experiment design (Wolkowicz et al. 2012), as well as approximation schemes employed in finite element modelling (Brenner and Scott 2007), and numerical integration (Kuipers and Niederreiter 2012). In contrast, primarily adaptive approaches that act in the presence of data but absence of a reliable model can be found in disciplines dealing with black box optimization (see, e.g. Fu (2014), ch. 11, for an overview of adaptive stochastic search methods. Classical grid-based brute-forcing or (pseudo-) random sampling schemes are model agnostic and do not change based on observed data. As such, they combine the inability to incorporate prior domain knowledge with the inability to adapt to observed data and may reasonably be estimated to be the weakest class of discretization strategies presented so far. Stochastic optimal control (Makiko 2014) avoids both of these inabilities. To the best of the authors' knowledge, it has not been used yet for adaptive optimal spatial discretization.

Open questions
It remains to investigate how adaptive optimal spatial discretization can be formulated as an optimal control problem and which framework is the right one to account for the implicit nature of the domain specific knowledge. Any algorithm designed to provide optimal discretization schemes would have to be critically evaluated with respect to its actual implementability and practical performance. If possible, we would like to support any claims of optimality with convergence guarantees and measures of uncertainty. In this paper, we investigate some of these questions and derive adaptive discretization schemes capable of incorporating domain knowledge via a training process known as reinforcement learning. The rest of the paper is organized as follows. "Spatial discretization as optimal control" focuses on formalizing the optimal discretization task. Solving this optimization problem requires techniques from reinforcement learning that are subsequently introduced in "RL-based solution". After outlining a policy gradient-based algorithm, we apply it to a collection of problems in "Application and results" and analyze the results. We conclude the paper with discussions and an outlook.

Abstract optimization formulation
Let M ⊂ R n s be a manifold and denote by the term y ∈ R n o the random observations a stochastic process on probability space indexed by elements x ∈ M. Denote by p xy the probability law of the stochastic process specifying the probability density of observing the sequence y 1 , . . . , y k at locations x 1 , . . . , x k . Then, the goal of optimal spatial discretization is to derive an optimal policy function π : π(·, ·; p xy ) : R k(n s +n o ) (x 1 , . . . , x k , y 1 , . . . , y k ) → x k+1 ∈ M (2) It maps a sequence of locations x j and associated observations y j to the location that is to be sampled next to minimize a cost function J (s 1 , . . . , s k , . . . , s n ; p xy ) measuring dissatisfaction with the sequence of states s 1 , . . . , s k , . . . , s n with s k = [x k , y k ] ∈ R n s +n o . Note that the cost function J depends on the sequence s 1 , . . . , s k of traversed states as well as the future states s k+1 , . . . , s n yet to be observed. The situation is illustrated in Fig. 2. This formulation has obvious similarities with optimal control problems, whose goal consists in finding a control input that steers a system governed by a stochastic differential equation towards achieving the minimum cost as measured by a functional (Makiko 2014, pp. 32-33). In our formulation, the control input is the next sampling location x k+1 and the system evolves discretely towards its next state s k+1 = [x k+1 , y k+1 ] . A concise formulation of optimal spatial discretization is the following nonlinear constrained minimization problem in the functional optimization variable π(·, ·; p xy ) : min π(·,·;p xy ) J (s 1 , . . . , s n ; p xy ) This formulation includes the various special cases mentioned during the survey of the state of the art (see Fig. 3 for an illustration).

Fig. 3
Different classes of discretization policies π . More arguments of the function π imply more flexibility to model dependencies between measured locations x k , measurements y k , and the objective function J Classical formulations of geodetic network design fall primarily into the category of "experiment design". Consider, e.g. the first order network design problem of choosing measurement locations x 1 , . . . , x n such that the parameter vector β in the least squares model is optimally estimable as measured by the maximum eigenvalue λ max of the covariance matrix β = (A T A) −1 of the estimated β's (Grafarend and Sansò 2011, pp. 56-73). This can be written in the same format as (3) by defining J (s 1 , . . . , s q ; p xy ) = λ max ( β ) arriving at min π(·,p xy ) While Eq. (5) can be reformulated (Boyd and Vandenberghe 2004, pp. 384-397) and potentially be solved as a mixed integer semidefinite program (Gally et al. 2018), it is still non-adaptive and NP-hard. However, under certain assumptions even the more difficult problem of determining an adaptive discretization scheme incorporating prior domain knowledge becomes computationally tractable. To this end, the optimization problem (3) is reframed as a problem involving Markov decision processes (MDP) and solved approximately using techniques from reinforcement learning.

Markov decision processes
A (discrete, finite) MDP is defined as a quadruple {S, A(s), r(s, a), p(s |s, a)} where s ∈ S, |S| = n s are states and A(s) is the set of actions available in state s. The term r(s, a) is a (random) reward earned for performing action a in state s and p(s |s, a) are the conditional probabilities of transitioning to a state s ∈ S when performing action a in state s (Feinberg and Shwartz 2012, p. 21). The policy function π : S s → a ∈ A(s) determining actions from states is specified as the variable of an optimization problem featuring the expected total return η .
Solving the above optimization problem means finding a policy π that for each state s produces decisions π(s) that maximize the sum of expected rewards. As rewards and probabilities are typically assumed conditionally independent of previous states and actions given the most recent state-action pair, the state transitions are Markovian (Melsa and Sage 2013, p.334), hence the name MDP. We can model the optimal spatial discretization problem (3) as solving the non-finite spatial discretization MDP −∞ indicates an absence of measurements for the optimal policy π : s k → x k+1 . Note that problem (7) exhibits design freedom in the choice of rewards whose total sum is to be maximized with reasonable choices being, e.g. negative reconstruction RMSE or detection errors. It is also possible to define rewards and transition probabilities implicitly by making reference to an environment (Sutton and Barto 2018, p. 44) reacting to the input of actions with the output of rewards and state transitions. This is a convenient formalism when a systems reaction to input might be simulated but the exact probability distributions are hard to compute. Before investigating the machinery of reinforcement learning that is designed to solve problem (7), we showcase simple instances of the spatial discretization problem.

Example processes related to discretization
The following 4 examples deal with the sequential choice of optimal spatial discretization. They are chosen for simplicity with the task being the identification of maximum deformations in 1D, 2D or spatiotemporal settings and serve as benchmark problems on which we will test different approaches of spatial discretization in section 4.1 . States are vectors containing previously sampled locations and observations and actions are decisions about the next location of measurement. The reward functions typically penalize the discrepancy between measured maximum and true maximum deformations. As the reward signal requires ground truth, the algorithm is trained on simulated datasets before it is deployed to scenarios, for which no ground truth is accessible.
Bending of a beam : Suppose a beam is subjected to a force with random magnitude and location of attack, see Fig. 1. Within 4 measurements, the greatest deformation max x y(x) needs to be found. The deformation response is a piecewise cubic polynomial whose exact form is taken from (Dietmar et al. 2011, p. 138).
The beam can be encoded as M = [0, L] ∪ {−∞} where [0, L] is the length of the beam. Similary, the space of observable values is y(M) = R ∪ {−∞} with an entry of −∞ indicating an observations absence. The set of states S is given by S = M 4 × y(M) 4 . As the only actions available for influencing the system are choices of measurement locations to be evaluated, we have the set of actions A being equal to M with A a = x next . Rewards are dispensed according to the severity with which the maximum deformation is underestimated, i.e. r(s k , a k ) = −| max j =1,...,k y k − max x∈ [0,L] y(x)|. The construction of the space of observations and the space of actions is the same for the subsequent examples.
Random deformations in 1D : Suppose y(x) is a 1D Gaussian process of deformations with a Brownian Bridge type correlation structure. Within 4 measurements, find the maximum deformation max x y(x) . While irregularity and randomness of the involved functions are higher than in example 1, states, actions, and rewards are the same.
Deformation tracking : Suppose a wavelike deformation y(x, t) is initiated at a random time and subsequently propagates along the interval [0, L] . During all times t k , measure deformations y k at locations x k , keep the last 10 in a state vector. Upon occurrence of the deformation, track its crest.

Random deformations in 2D
: Suppose y(x) is a 2D Gaussian process. Within 10 measurements, find the maximum deformation max x y(x).

Concepts of reinforcement learning
Reinforcement learning is a subdiscipline of machine learning that neither strictly belongs to supervised learning nor unsupervised learning. It aims to solve optimal control tasks in which the agent exerting this control signal has no access to a pre-made model of the environment (Sutton and Barto 2018, p. 44). Instead, it has to rely on interacting with this environment, which communicates to the agent a state change and reward signal upon being submitted an action; in other words its goal is to solve the implicit version of problem Eq. (7).
To this end, the policy π(·) (briefly assumed to be stochastic) mapping from S × A to R is modelled as a parametric function π(·) = π θ (·) with θ ∈ the parameters. The assumption of stochastic policies implies that for each state, actions are chosen probabilistically. It leads to convenient formulations of policy gradients but is otherwise cumbersome-we will not refer to it outside of Eqs. (7) and (8). Data gathered from interactions with the environment is used to form policy gradients ∇ θ η of the expected return η, with respect to the policy parameters θ . This enables adjusting the parameters based on observations. Denote by ξ = [(s 0 , a 0 ), . . . , (s n , a n )] ∈ (S × A) n+1 the random sample path, by p θ (ξ ) its probability density, and by ξ j = [(s Policy gradient estimates∇ θ η can be formulated using, e.g. classical Monte Carlo techniques. They take the form (Wiering and van Otterlo 2012, p. 367) (8) presents the gradients of expected returns making use of the derivatives of the probability distribution π(s, a) w.r.t. the parameters θ. The equation furthermore features the observable sample paths and rewards but nothing else thereby making it a practical ingredient to gradient-based adjustments to policies. Adaptations of (8) for deterministic policies π θ exist (Silver et al. 2014). In addition to the policy suggesting next actions, the quality of the proposed stateaction pairs is often estimated by a critic function yielding the so-called actor-critic methods (Sutton et al. 1999). The role of critic is taken over by the state-action value function Q π : S × A → R (Wiering and van Otterlo 2012, p. 16) which predicts the expected return of being in state s 0 , taking action a 0 , and following policy π afterwards. The optimal policy π * implies a recursive relationship on Q π * that is known as the Bellman optimality equation (Wiering and van Otterlo 2012, p. 17) The recursive behaviour of Q π * enables its iterative determination via the Banach fixed point theorem and provides the basis for practical implementations which model both the state action value functions and the policy as neural networks. More detailed descriptions are available in the subsequent section.

Implementation
Modern approaches combine state-action value functions and deterministic policy gradients in form of actor-critic methods (Silver et al. 2014). In subsequent applications, we use the algorithm TD3 which is an abbreviation for twin delayed deep deterministic policy gradients. It employs two deep neural networks to formulate guesses for the state action value function Q(s, a) acting as the critic and one neural network π that performs as an actor by suggesting actions. While the algorithm runs, it interacts with the environment by selecting actions based on the actor or random exploration. In regular intervals, previously observed state transitions and rewards are used to update the critic networks and the actor network along the policy gradients in a way that controls the overestimations incurred from applying the max operator to noisy data (Scott et al. 2018). Figure 4 outlines the interactions between actions, state transitions, rewards, and policies in the context of spatial discretization.
Note the difference in required information during training and deployment of the model. Training requires the reward function and thereby access to the ground truth whereas after the training phase and during deployment the state observations alone are enough to drive the selection of further measurement locations. This allows training to be done on a computational model with deployment in the Fig. 4 The RL algorithm devises a policy mapping states to actions. The environment changes states upon being presented with an action and provides a reward signal used by the RL algorithm to improve the policy The table records the average reward and its spread via the empirical standard deviation based on 1000 trial runs. Bold font indicates the best results for each task. For the deformation tracking task, we have decided to not report the results of quadrature sampling that is evidently ill suited to dynamic, time-dependent data (sampling the same n grid points repeatedly is incapable of tracking a deformation) real world independent of an inaccessible reward signal. We use the open source implementation provided by the python library stable baselines 3 (Raffin et al. 2019) based on the open source machine learning framework pytorch (Adam et al. 2019). The environments described in "Example processes related to discretization" have been formulated in terms of python classes serving as input to the TD3 algorithm. They are available for download on our GitHub page 1 where also the full scripts for benchmarking different approaches to optimal spatial discretization can be accessed.

Results for toy examples
The four simple examples outlined in "Example processes related to discretization" have been tackled with the TD3 algorithm. The algorithm was taken without altering the parameters of its default initialization. Both critic networks and the actor network are neural networks with a normalization layer and two hidden layers of 64 neurons each. It was then trained for 100,000 timesteps with each timestep representing one decision on where to measure next and the corresponding reward conveyed back to the agent from the environment. This takes between 40 and 60 min on a standard office laptop. We compare the performance of the fully trained agents to the performance achieved by the following alternative strategies, each of which use the same amount of sampling locations: The methods were implemented in Python, mainly using numpy (Harris et al. 2020). Table 1 summarizes our experiments .
The table highlights that reinforcement learning is able to outperform the alternative strategies by a significant margin; typically by a factor of 2 to 4 in our examples. Its unique advantage is its ability to adapt to arbitrary environments by integrating domain specific knowledge from interacting with the system. This comes at the price of a high computational burden during the training stage. Whereas no training is needed for the task-agnostic methods and the optimization performed during basinhopping typically converges within a few seconds, training the reinforcement learning agent takes up to a full hour. Deployment of the algorithm after training is a matter of split-seconds, though.

Analysis and interpretation
While the table indicates RL being feasible for the task of spatial discretization, it is a very aggregated summary of performance. We now analyze and interpret the behavior of RL in specific problem instances and formulate expectations of its advantages and disadvantages w.r.t. other methods. One observation immediately derivable from the table is that RL outperforms other methods particularly strongly on the beam bending and deformation tracking tasks, where its reward is 3-4 times better. As the other methods don't permit their decisions on the sequences of measurement locations to be influenced by the current observations, they are ill-suited to perform spatiotemporal tasks which explains their performance on the deformation tracking task. RL's near-perfect score on the beam bending task may be traced back to the neural network being able to learn the deterministic relationship between the size of deformations at some locations and the location of the maximum deformation. Figure 5 illustrates the results of applying the trained policies to the beam bending and 2D deformation tasks.
Note that the sequence of measurement locations is always the same for the experiment-design approach. The policy trained by reinforcement learning, however, adapts and samples different locations depending on observed deformations. This adaptivity is leveraged into a comparative advantage. Parsing through several hundred of examples visually, we found no obvious fail cases. Nonetheless, the impact of differences between the computational model used for training and the real world behavior of the Fig. 5 Sampling sequences and corresponding rewards for policies trained with reinforcement learning and experiment design on the beam bending and 2D deformation tasks actual physical model to be discretized are hard to quantify. While model-mismatch induced errors are prevalent in all methods, they are especially serious for RL for which -as of yet -no reliable means of uncertainty quantification is available.

Conclusion
We have presented the problem of optimal spatial discretization of objects or dynamical systems as one of the recurring tasks in the design of geodetic measurement schemes employed for gathering data during measurement campaigns or monitoring. We reformulated it as an optimal control problem and established a hierarchy based on the flexibility of the policy function π(·) mapping previous observations to the next measurement locations. This contextualized classical approaches and the framework of Markov decision processes, into which we embedded the problem upon which it becomes amenable to the machinery of reinforcement learning.
We showed with the help of 4 simple examples that reinforcement learning is both flexible and powerful. It outperforms other methods based on discretization strategies employing gridlike, (pseudo) random, and the non-adaptive optimization of point distributions. At the same time, reinforcement learning only needs an environment -an interactive black box -to achieve this. This precludes the need for explicitly prescribing probabilities. It opens up the possibility of optimizing measurement strategies based on a numerical model of a systems behavior and a performance index measuring desired information. The measurement strategy is adaptive and makes use of domain-specific knowledge. In this way, a solution is provided to high-dimensional, sequential, and stochastic problems for which conventional optimization algorithms like least squares or exhaustive search are not applicable.
Practically, effort normally invested by the practitioner into stochastic modelling is instead outsourced to the algorithm. It can handle high-dimensional and continuous state and action spaces and atypical implicit probability distributions that would be hard to model manually. This makes prototyping fast and easy. The excellent performance of the methods is only overshadowed by the data inefficiency of current implementations that disparage their use in most situations where data would have to be gathered painstakingly in a real world setting rather than from simulations. Considering the above, we see the possibility of practical use cases of reinforcement learning in geodetic monitoring wherever computational models are available: in structural health monitoring, geomechanical and geophysical settings.