Optimal sequential decision making with probabilistic digital twins

Digital twins are emerging in many industries, typically consisting of simulation models and data associated with a specific physical system. One of the main reasons for developing a digital twin, is to enable the simulation of possible consequences of a given action, without the need to interfere with the physical system itself. Physical systems of interest, and the environments they operate in, do not always behave deterministically. Moreover, information about the system and its environment is typically incomplete or imperfect. Probabilistic representations of systems and environments may therefore be called for, especially to support decisions in application areas where actions may have severe consequences. In this paper we introduce the probabilistic digital twin (PDT). We will start by discussing how epistemic uncertainty can be treated using measure theory, by modelling epistemic information via $\sigma$-algebras. Based on this, we give a formal definition of how epistemic uncertainty can be updated in a PDT. We then study the problem of optimal sequential decision making. That is, we consider the case where the outcome of each decision may inform the next. Within the PDT framework, we formulate this optimization problem. We discuss how this problem may be solved (at least in theory) via the maximum principle method or the dynamic programming principle. However, due to the curse of dimensionality, these methods are often not tractable in practice. To mend this, we propose a generic approximate solution using deep reinforcement learning together with neural networks defined on sets. We illustrate the method on a practical problem, considering optimal information gathering for the estimation of a failure probability.


Introduction
The use of digital twins has emerged as one of the major technology trends the last couple of years.In essence, a digital twin (DT) is a digital representation of some physical system, including data from observations of the physical system, which can be used to perform forecasts, evaluate the consequences of potential actions, simulate possible future scenarios, and in general inform decision making without requiring interference with the physical system.From a theoretical perspective, a digital pre-print: This is a pre-print version of this article twin may be regraded to consist of the following two components: • A set of assumptions regarding the physical system (e.g. about the behaviour or relationships among system components and between the system and its environment), often given in the form of a physics-based numerical simulation model.
• A set of information, usually in the form of a set of observations, or records of the relevant actions taken within the system.
In some cases, a digital twin may be desired for a system which attributes and behaviours are not deterministic, but stochastic.For example, the degradation and failure of physical structures or machin-arXiv:2103.07405v1[stat.ML] 12 Mar 2021 ery is typically described as stochastic processes.A systems performance may be impacted by weather or financial conditions, which also may be most appropriately modelled as stochastic.Sometimes the functioning of the system itself is stochastic, such as supply chain or production chains involving stochastic variation in demand and performance of various system components.
Even for systems or phenomena that are deterministic in principle, a model will never give a perfect rendering of reality.There will typically be uncertainty about the model's structure and parameters (i.e.epistemic uncertainty), and if consequences of actions can be critical, such uncertainties need to be captured and handled appropriately by the digital twin.In general, the system of interest will have both stochastic elements (aleatory uncertainty) and epistemic uncertainty.
If we want to apply digital twins to inform decisions in systems where the analysis of uncertainty and risk is important, certain properties are required: 1.The digital twin must capture uncertainties: This could be done by using a probabilistic representation for uncertain system attributes.2. It should be possible to update the digital twin as new information becomes available: This could be from new evidence in the form of data, or underlying assumptions about the system that have changed.3.For the digital twin to be informative in decision making, it should be possible to query the model sufficiently fast: This could mean making use of surrogate models or emulators, which introduces additional uncertainties.
These properties are paraphrased from Hafver et al. [1], which provides a detailed discussion on the use of digital twins for on-line risk assessment.In this paper we propose a mathematical framework for defining digital twins that comply with these properties.As Hafver et al. [1], we will refer to these as probabilistic digital twins (PDTs), and we will build on the Bayesian probabilistic framework which is a natural choice to satisfy (1)- (2).
A numerical model of a complex physical system can often be computationally expensive, for instance if it involves numerical solution of nontrivial partial differential equations.In a probabilistic setting this is prohibitive, as a large number of evaluations (e.g.PDE solves) is needed for tasks involving uncertainty propagation, such as prediction and inference.Applications towards real-time decision making also sets natural restrictions with respect to the runtime of such queries.This is why property (3) is important, and why probabilistic models of complex physical phenomena often involve the use of approximate alternatives, usually obtained by "fitting" a computationally cheap model to the output of a few expensive model runs.These computationally cheap approximations are often referred to as response surface models, surrogate models or emulators in the literature.
Introducing this kind of approximation for computational efficiency also means that we introduce additional epistemic uncertainty into our modelling framework.By epistemic uncertainty we mean, in short, any form of uncertainty that can be reduced by gathering more information (to be discussed further later on).In our context, uncertainty may in principle be reduced by running the expensive numerical modes instead of the cheaper approximations.
Many interesting sequential decision making problems arise from the property that our knowledge about the system we operate changes as we learn about the outcomes.That is, each decision may affect the epistemic uncertainty which the next decision will be based upon.We are motivated by this type of scenario, in combination with the challenge of finding robust decisions in safety-critical systems, where a decision should be robust with respect to what we do not know, i.e. with respect to epistemic uncertainty.Although we will not restrict the framework presented in this paper to any specific type of sequential decision making objectives, we will mainly focus on problems related to optimal information gathering.That is, where the decisions we consider are related to acquiring information (e.g., by running an experiment) in order to reduce the epistemic uncertainty with respect to some specified objective (e.g., estimating some quantity of interest).
A very relevant example of such a task, is the problem of optimal experimental design for structural reliability analysis.This involves deciding which experiments to run in order to build a surrogate model that can be used to estimate a failure probability with sufficient level of confidence.This is a problem that has received considerable attention (see e.g.[2,3,4,5,6,7,8]).These methods all make use of a myopic (one-step lookahead) criterion to determine the "optimal" experiment, as a multistep or full dynamic programming formulation of the optimization problem becomes numerically infeasible.In Agrell and Dahl [2], they consider the case where there are different types of experiments to choose from.Here, the myopic (one-step lookahead) assumption can still be justified, but if the different types of experiments are associated with different costs, then it can be difficult to apply in practice (e.g., if a feasible solution requires expensive experiments with delayed reward).
We will review the mathematical framework of sequential decision making, and connect this to the definition of a PDT.Traditionally, there are two main solution strategies for solving discrete time sequential decision making problems: Maximum principles, and dynamic programming.We review these two solution methods, and conclude that the PDT framework is well suited for a dynamic programming approach.However, dynamic programming suffers from the curse of dimensionality, i.e. possible sequences of decisions and state realizations grow exponentially with the size of the state space.Hence, we are typically not able to solve a PDT sequential decision making problem in practice directly via dynamic programming.
As a generic solution to the problem of optimal sequential decision making we instead propose an alternative based on reinforcement learning.This means that when we consider the problem of finding an optimal decision policy, instead of truncating the theoretical optimal solution (from the Bellman equation) by e.g., looking only one step ahead, we try to approximate the optimal policy.This approximation can be done by using e.g. a neural network.Here we will frame the sequential decision making setup as a Markov decision process (MDP), in general as a partially observed MDP (POMDP), where a state is represented by the information available at any given time.This kind of state specification is often referred to as the information state-space.As a generic approach to deep reinforcement learning using PDTs, we propose an approach using neural networks that operate on the information statespace directly.
Main contributions.In this paper we will: (i) Propose a mathematical framework for modelling epistemic uncertainty based on measure theory, and define epistemic conditioning.(ii) Present a mathematical definition of the probabilistic digital twin (PDT).This is a mathematical framework for modelling physical systems with aleatory and epistemic uncertainty.
(iii) Introduce the problem of sequential decision making in the PDT, and illustrate how this problem can be solved (at least in theory) via maximum principle methods or the dynamic programming principle.(iv) Discuss the curse of dimensionality for these solution methods, and illustrate how the sequential decision making problem in the PDT can be viewed as a partially observable Markov decision process.(v) Explain how reinforcement learning (RL) can be applied to find approximate optimal strategies for sequential decision making in the PDT, and propose a generic approach using a deep sets architecture that enables RL directly on the information state-space.We end with a numerical example to illustrate this approach.
The paper is structured as follows: In Section 2 we introduce epistemic uncertainty and suggest modeling this via σ-algebras.We also define epistemic conditioning.In Section 3, we present the mathematical framework, as well as a formal definition, of a probabilistic digital twin (PDT), and discuss how such PDTs are used in practice.
Then, in Section 4, we introduce the problem of stochastic sequential decision making.We discuss the traditional solution approaches, in particular dynamic programming which is theoretically a suitable approach for decision problems that can be modelled using a PDT.However, due to the curse of dimensionality, using the dynamic programming directly is typically not tractable.We therefore turn to reinforcement learning using function approximation as a practical alternative.In Section 5, we show how an approximate optimal strategy can be achieved using deep reinforcement learning, and we illustrate the approach with a numerical example.Finally, in Section 6 we conclude and sketch some future works in this direction.

A measure-theoretic treatment of epistemic uncertainty
In this section, we review the concepts of epistemic and aleatory uncertainty, and introduce a measure-theoretic framework for modelling epistemic uncertainty.We will also define epistemic conditioning.

Motivation
In uncertainty quantification (UQ), it is common to consider two different kinds of uncertainty: Aleatory (stochastic) and epistemic (knowledgebased) uncertainty.We say that uncertainty is epistemic if we foresee the possibility of reducing it through gathering more or better information.For instance, uncertainty related to a parameter that has a fixed but unknown value is considered epistemic.Aleatory uncertainty, on the other hand, is the uncertainty which cannot (in the modellers perspective) be affected by gathering information alone.Note that the characterization of aleatory and epistemic uncertainty has to depend on the modelling context.For instance, the result of a coin flip may be viewed as epistemic, if we imagine a physics-based model that could predict the outcome exactly (given all initial conditions etc.).However, under most circumstances it is most natural to view a coin flip as aleatory, or that it contains both aleatory and epistemic uncertainty (e.g. if the bias of the coin us unknown).Der Kiureghian et al. [9] provides a detailed discussion of the differences between aleatory and epistemic uncertainty.
In this paper, we have two main reasons for distinguishing between epistemic and aleatory uncertainty.First, we would like to make decisions that are robust with respect to epistemic uncertainty.Secondly, we are interested in studying the effect of gathering information.Modelling epistemic uncertainty is a natural way of doing this.
In the UQ literature, aleatory uncertainty is typically modelled via probability theory.However, epistemic uncertainty is represented in many different ways.For instance, Helton [10] considers four different ways of modelling epistemic uncertainty: Interval analysis, possibility theory, evidence theory (Dempster-Shafer theory) and probability theory.
In this paper we take a measure-theoretic approach.This provides a framework that is relatively flexible with respect to the types of assumptions that underlay the epistemic uncertainty.As a motivating example, consider the following typical setup used in statistics: where Y is a random variable representing some stochastic phenomenon, and assume Y is modelled using a given probability distribution, P (Y |θ), that depends on a parameter θ (e.g.Y ∼ N (µ, σ) with θ = (µ, σ)).Assume that we do not know the value of θ, and we therefore consider θ as a (purely) epistemic parameter.For some fixed value of θ, the random variable Y is (purely) aleatory, but in general, as the true value of θ is not known, Y is associated with both epistemic and aleatory uncertainty.
The model X in Example 2.1 can be decoupled into an aleatory component Y |θ and an epistemic component θ.Any property of the aleatory uncertainty in X is determined by P (Y |θ), and is therefore a function of θ.For instance, the probability P (Y ∈ A|θ) and the expectation E[f (Y )|θ], are both functions of θ.There are different ways in which we can choose to address the epistemic uncertainty in θ.We could consider intervals, for instance the minimum and maximum of P (Y ∈ A|θ) over any plausible value of θ, or assign probabilities, or some other measure of belief, to the possible values θ may take.However, in order for this to be well-defined mathematically, we need to put some requirements on A, f and θ.By using probability theory to represent the aleatory uncertainty, we implicitly assume that the set A and function f are measurable, and we will assume that the same holds for θ.We will describe in detail what is meant by measurable in Section 2.2 below.Essentially, this is just a necessity for defining properties such as distance, volume or probability in the space where θ resides.
In this paper we will rely on probability theory for handling both aleatory and epistemic uncertainty.This means that, along with the measurability requirement on θ, we have the familiar setup for Bayesian inference: Example 2.2.(A parametric model -Inference and prediction) If θ from Example 2.1 is a random variable with distribution P (θ), then X = (Y, θ) denotes a complete probabilistic model (capturing both aleatory and epistemic uncertainty).X is a random variable with distribution P (X) = P (Y |θ)P (θ).
Let I be some piece of information from which Bayesian inference is possible, i.e.P (X|I) is well defined.We may then define the updated joint distribution Note that the distribution P new (X) in Example 2.2 is obtained by only updating the belief with respect to epistemic uncertainty, and that P new (X) = P (X|I) = P (Y |I, θ)P (θ|I).
For instance, if I corresponds to an observation of Y , e.g.I = {Y = y}, then P (Y |I) = δ(y), the Dirac delta at y, whereas P (θ|I) is the updated distribution for θ having observed one realization of Y .In the following, we will refer to the kind of Bayesian updating in Example 2.2 as epistemic updating.
This But what can we say in a more general setting?It is common that epistemic uncertainty comes from lack of knowledge related to functions.This is the case with probabilistic emulators and surrogate models.The input to these functions may contain epistemic and/or aleatory uncertainty as well.Can we talk about isolating and modifying the epistemic uncertainty in such a model, without making reference to the specific details of how the model has been created?In the following we will show that with the measure-theoretic framework, we can still make use of a simple formulation like the one in Example 2.2.

The probability space
Let X be a random variable containing both aleatory and epistemic uncertainty.In order to describe how X can be treated like in Example 2.1 and Example 2.2, but for the general setting, we will first recall some of the basic definitions from measure theory and measure-theoretic probability.
To say that X is a random variable, means that X is defined on some measurable space (Ω, F).Here, Ω is a set, and if X takes values in R n (or some other measurable space), then X is a so-called measurable function, X(ω) : Ω → R n (to be defined precisely later).Any randomness or uncertainty about X is just a consequence of uncertainty regarding ω ∈ Ω.As an example, X could relate to a some 1-year extreme value, whose uncertainty comes from day to day fluctuations, or some fundamental stochastic phenomenon represented by ω ∈ Ω. Examples of natural sources of uncertainty are weather or human actions in large scale.Therefore, whether modeling weather, option prices, structural safety at sea or traffic networks, stochastic models should be used.
The probability of the event {X ∈ E}, for some subset E ⊂ R n , is really the probability of {ω ∈ X −1 (E)}.Technically, we need to ensure that {ω ∈ X −1 (E)} is something that we can compute the probability of, and for this we need F. F is a collection of subsets of Ω, and represents all possible events (in the "Ω-world").When F is a σ-algebra 1 the pair (Ω, F) becomes a measurable space.
So, when we define X as a random variable taking values in R n , this means that there exists some measurable space (Ω, F), such that any event {X ∈ E} in the "R n -world" (which has its own σ-algebra) has a corresponding event {ω ∈ X −1 (E)} ∈ F in the "Ω-world".It also means that we can define a probability measure on (Ω, F) that gives us the probability of each event, but before we introduce any specific probability measure, X will just be a measurable function 2 .
-We start with assuming that there exists some measurable space (Ω, F) where X is a measurable function.
The natural way to make X into a random variable is then to introduce some probability measure3 P on F, giving us the probability space (Ω, F, P ).
-Given a probability measure P on (Ω, F) we obtain the probability space (Ω, F, P ) on which X is defined as a random variable.
We have considered here, for familiarity, that X takes values in R n .When no measure and σ-algebra is stated explicitly, one can assume that R n is endowed with the Lebesgue measure (which under-lies the standard notion of length, area and volume etc.) and the Borel σ-algebra (the smallest σalgebra containing all open sets).Generally, X can take values in any measurable space.For example, X can map from Ω to a space of functions.This is important in the study of stochastic processes.

The epistemic sub σ-algebra E
In the probability space (Ω, F, P ), recall that the σ-algebra F contains all possible events.For any random variable X defined on (Ω, F, P ), the knowledge that some event has occurred provides information about X.This information may relate to X in a way that it only affects epistemic uncertainty, only aleatory uncertainty, or both.We are interested in specifying the events e ∈ F that are associated with epistemic information alone.It is the probability of these events we want to update as new information is obtained.The collection E of such sets is itself a σ-algebra, and we say that is the sub σ-algebra of F representing epistemic information.
We illustrate this in the following examples.In Example 2.3, we consider the simplest possible scenario represented by the flip of a biased coin, and in Example 2.4 a familiar scenario from uncertainty quantification involving uncertainty with respect to functions.
Example 2.3.(Coin flip) Define X = (Y, θ) as in Example 2.1, and let Y ∈ {0, 1} denote the outcome of a coin flip where "heads" is represented by Y = 0 and "tails" by Y = 1.Assume that P (Y = 0) = θ for some fixed but unknown θ ∈ [0, 1].For simplicity we assume that θ can only take two values, θ ∈ {θ 1 , θ 2 } (e.g.there are two coins but we do not know which one is being used). Then Example 2.4.(UQ) Let X = (x, y) where x is an aleatory random variable, and y is the result of a fixed but unknown function applied to x.We let y = f (x) where f is a function-valued epistemic random variable.
If x is defined on a probability space (Ω x , F x , P x ) and f is a stochastic process defined on (Ω f , F f , P f ), then (Ω, F, P ) can be defined as the product of the two spaces and E as the projection In the following, we assume that the epistemic sub σ-algebra E has been identified.
Given a random variable X, we say that X is Emeasurable if X is measurable as a function defined on (Ω, E).We say that X is independent of E, if the conditional probability P (X|e) is equal to P (X) for any event e ∈ E. With our definition of E, we then have for any random variable X on (Ω, F, P ) that -X is purely epistemic if and only if X is Emeasurable, -X is purely aleatory if and only if X is independent of E.

Epistemic conditioning
Let X be a random variable on (Ω, F, P ) that may contain both epistemic and aleatory uncertainty, and assume that the epistemic sub σ-algebra E is given.By epistemic conditioning, we want to update the epistemic part of the uncertainty in X using some set of information I.In Example 2.3 this means updating the probabilities P (θ = θ 1 ) and P (θ = θ 2 ), and in Example 2.4 this means updating P f .In order to achieve this in the general setting, we first need a way to decouple epistemic and aleatory uncertainty.This can actually be made fairly intuitive, if we rely on the following assumption: Assumption 2.5.There exists a random variable θ : Ω → Θ that generates4 E.
If this generator θ exists, then for any fixed value θ ∈ Θ, we have that X|θ is independent of E. Hence X|θ is purely aleatory and θ is purely epistemic.
We will call θ the epistemic generator, and we can interpret θ as a signal that reveals all epistemic information when known.That is, if θ could be observed, then knowing the value of θ would remove all epistemic uncertainty from our model.As it turns out, under fairly mild conditions one can always assume existence of this generator.One sufficient condition is that (Ω, F, P ) is a standard probability space, and then the statement holds up to sets of measure zero.This is a technical requirement to avoid pathological cases, and does not provide any new intuition that we see immediately useful, so we postpone further explanation to Appendix A.
Example 2.7.(UQ -epistemic generator) In this example, when (Ω, F, P ) is the product of an aleatory space (Ω x , F x , P x ) and an epistemic space Alternatively, given only the space (Ω, F, P ) where both x and f are defined, assume that f is a Gaussian process (or some other stochastic process for which the Karhunen-Loéve theorem holds).Then there exists a sequence of deterministic functions φ 1 , φ 2 , . . .and an infinite-dimensional vari- and we can let E be generated by θ.
The decoupling of epistemic and aleatory uncertainty is then obtained by considering the joint variable (X, θ) instead of X alone, because P (X, θ) = P (X | θ)P (θ). ( From (2) we see how the probability measure P becomes the product of the epistemic probability P (θ) and the aleatory probability P (X|θ) when applied to (X, θ).Given new information, I, we will update our beliefs about θ, P (θ) → P (θ|I), and we define the epistemic conditioning as follows: (3)

Two types of assumptions
Consider the probability space (Ω, F, P ), with epistemic sub σ-algebra E.Here E represents epistemic information, which is the information associated with assumptions.In other words, an epistemic event e ∈ E represents an assumption.In fact, given a class of assumptions, the following Remark 2.8, shows why σ-algebras are appropriate structures.
Remark 2.8.Let E be a collection of assumptions.If e ∈ E, this means that it is possible to assume that e is true.If it is also possible to assume that that e is false, then.ē ∈ E as well.It may then also be natural to require that e 1 , e 2 ∈ E ⇒ e 1 ∩ e 2 ∈ E, and so on.These are the defining properties of a σ-algebra.
For any random variable X defined on (Ω, F, P ), when E is a sub σ-algebra of F, X|e for e ∈ E is well defined, and represents the random variable under the assumption e.In particular, given any fixed epistemic event e ∈ E we have a corresponding aleatory distribution P (X|e) over X, and the conditional P (X|E) is the random measure corresponding to P (X|e) when e is a random epistemic event in E. Here, the global probability measure P when applied to e, P (e), is the belief that e is true.In Section 2.4 we discussed updating the part of P associated with epistemic uncertainty.We also introduced the epistemic generator θ in order to associate the event e with an outcome θ(e), and make use of P (X|θ) in place of P (X|E).This provides a more intuitive interpretation of the assumptions that are measurable, i.e. those whose belief we may specify through P .
Of course, the measure P is also based on assumptions.For instance, if we in Example 2.1 assume that Y follows a normal distribution.One could in principle specify a (measurable) space of probability distributions, from which the normal distribution is one example.Otherwise, we view the normality assumption as a structural assumption related to the probabilistic model for X, i.e. the measure P .These kinds of assumptions cannot be treated the same way as assumptions related to measurable events.For instance, the consequence of the complement assumption "Y does not follow a normal distribution" is not well defined.
In order to avoid any confusion, we split the assumptions into two types: 1.The measurable assumptions represented by the σ-algebra E, and 2. the set M of structural assumptions underlying the probability measure P .
This motivates the following definition.
Definition 2.9 (Structural assumptions).We let M denote the set of structural assumptions that defines a probability measure on (Ω, F), which we may write We may also refer to M as the non-measurable assumptions, to emphasize that M contains all the assumptions not covered by E. When there is no risk of confusion we will also suppress the dependency on M and just write P (•).Stating the set M explicitly is typically only relevant for scenarios where we consider changes being made to the actual system that is being modelled, or for evaluating different candidate models, e.g. through the marginal likelihood P (I|M ).In practice one would also state M so that decision makers can determine their level of trust in the probabilistic model, and the appropriate level of caution when applying the model.
As we will see in the upcoming section, making changes to M and making changes to how P M acts on events in E are the two main ways in which we update a probabilistic digital twin.

The Probabilistic Digital Twin
The object that we will call probabilistic digital twin, PDT for short, is a probabilistic model of a physical system.It is essentially a (possibly degenerate) probability distribution of a vector X, representing the relevant attributes of the system, but where we in addition require the specification of epistemic uncertainty (assumptions) and how this uncertainty may be updated given new information.
Before presenting the formal definition of a probabilistic digital twin, we start with an example showing why the identification of epistemic uncertainty is important.

Why distinguish between aleatory and epistemic uncertainty?
The decoupling of epistemic and aleatory uncertainty (as described in Section 2.4) is central in the PDT framework.There are two good reasons for doing this: 1. We want to make decisions that are robust with respect to epistemic uncertainty.2. We want to study the effect of gathering information.
Item 1. relates to the observation that decision theoretic approaches based on expectation may not be robust.That is, if we marginalize out the epistemic uncertainty (and considering only E θ [P (X|θ)] = P (X|θ)dP θ ).We give two examples of this below, see Example 3.1 and Example 3.2.
Item 2. means that by considering the effect of information on epistemic uncertainty, we can evaluate the value of gathering information.This is discussed in further detail in Section 4.7.3), we let θ 1 = 0.5, θ 2 = 0.99.Assume that you are given the option to guess the outcome of X.If you guess correct, you collect a reward of R = 10 6 $, otherwise you have to pay L = 10 6 $.A priori your belief about the bias of the coin is that P (θ = 0.5) = P (θ = 0.99) = 0.5.If you consider betting on X = 0, then the expected return, obtained by marginalizing over θ, becomes P (θ = 0.5)(0.5R−0.5L) + P (θ = 0.99)(0.99R− 0.01L) = 490.000$.
This is a scenario where decisions supported by taking the expectation with respect to epistemic uncertainty is not robust, as we believe that θ = 0.5 and θ = 0.99 are equally likely, and if θ = 0.5 we will lose 10 6 $ 50% of the time by betting on X = 0.In structural reliability analysis, we are dealing with an unknown function g with the property that the event {y = g(x) < 0} corresponds to failure.When g is represented by a random function ĝ with epistemic uncertainty, the failure probability is also uncertain.Or in other words, if ĝ is epistemic then ĝ is a function of the generator θ.Hence, the failure probability is a function of θ.We want to make use of a conservative estimate of the failure probability, i.e., use a conservative value of θ.P (θ) tells us how conservative a given value of θ is.

The attributes X
To define a PDT, we start by considering a vector X consisting of the attributes of some system.This means that X is a representation of the physical object or asset that we are interested in.In general, X describes the physical system.In addition, X must contain attributes related to any type of information that we want to make use of.For instance, if the information consists of observations, the relevant observable quantities, as well as attributes related to measurement errors or noise, may be included in X.In general, we will think of a model of a system as a set of assumptions that describes how the components of X are related or behave.
The canonical example here is where some physical quantity is inferred from observations including errors and noise, in which case a model of the physical quantity (physical system) is connected with a model of the data generating process (observational system).We are interested in modelling dependencies with associated uncertainty related to the components of X, and treat X as a random variable.
The attributes X characterise the state of the system and the processes that the PDT represents.X may for instance include: • System parameters representing quantities that have a fixed, but possibly uncertain, value.For instance, these parameters may be related to the system configuration.
• System variables that may vary in time, and which value may be uncertain.
• System events i.e., the occurrence of defined state transitions.
In risk analysis, one is often concerned with risk measures given as quantified properties of X, usually in terms of expectations.For instance, if X contains some extreme value (e.g. the 100-year wave) or some specified event of failure (using a binary variable), the expectations of these may be compared against risk acceptance criteria to determine compliance.

The PDT definition
Based on the concepts introduced so far, we define the PDT as follows: where X is a vector of attributes of a system, A contains the assumptions needed to specify a probabilistic model, and I contains information regarding actions and observations: A = ((Ω, F), E, M ), where (Ω, F) is a measure space where X is measurable, and E is the sub σ-algebra representing epistemic information.M contains the structural assumptions that defines a probability measure P M on (Ω, F).
I is a set consisting of events of the form (d, o), where d encodes a description of the conditions under which the observation o was made, and where the likelihood P (o|X, d) is well defined.For brevity, we will write this likelihood as P (I|X) when I contains multiple events of this sort.
When M is understood, and there is no risk on confusion, we will drop stating the dependency on M explicitly and just refer to the probability space (Ω, F, P ).
It is important to note that consistency between I and P (X) is required.That is, when using the probabilistic model for X, it should be possible to simulate the type of observations given by I.In this case the likelihood P (I|X) is well defined, and the epistemic updating of X can be obtained from Bayes' theorem.
Finally, we note that with this setup the information I may contain observations made under different conditions than what is currently specified through M .The information I is generally defined as a set of events, given as pairs (d, o), where the d encodes the relevant action leading to observing o, as well as a description of the conditions under which o was observed.Here d may relate to modifications of the structural assumptions M , for instance if the the causal relationships that describes the model of X under observation of o is not the same as what is currently represented by M .This is the scenario when we perform controlled experiments.Alternatively, (d, o) may represent a passive observation, e.g.d = "measurement taken from sensor 1 at time 01:02:03", o = 1.7 mm.We illustrate this in the following example.
x 1 x 2 y ε We define a model M corresponding to where p x1 is a probability density depending on the parameter θ 1 and f (•, θ 2 ) is a deterministic function depending on the parameter θ 2 .
θ 1 and θ 2 are epistemic parameters for which we define a joint density p θ .
Assume that I = {(d (1) , o (1) ), . . ., (d (n) , o (n) )} is a set of controlled experiments, where and unknown i.i.d.ε (i) ∼ p ε .In this scenario, regression is performed by updating the distribution p θ to agree with the observations: where Z is a constant ensuring that the updated density integrates to one.
Note that the scenario with controlled experiments in Example 3.4 corresponds to a different model than the one in Figure 1.This is a familiar scenario in the study of causal inference, where actively setting the value of x 1 is the do-operator (see Pearl [11]) which breaks link between x 1 and x 2 .

Corroded pipeline example
To give a concrete example of a system where the PDT framework is relevant, we consider the following model from Agrell and Dahl [2].This is based on a probabilistic structural reliability model which is recommended for engineering assessment of offshore pipelines with corrosion (DNV GL RP-F101 [12]).It is a model of a physical failure mechanism called pipeline burst, which may occur when the pipeline's ability to withstand the high internal pressure has been reduced as a consequence of corrosion.We will describe just a general overview of this model, and refer to [2, Example 4] for specific details regarding probability distributions etc.Later, in Section 5.6, we will revisit this example and make use of reinforcement learning to search for an optimal way of updating the PDT.
Figure 2 shows a graphical representation of the structural reliability model.Here, a steel pipeline is characterised by the outer diameter D, the wall thickness t and the ultimate tensile strength s.The pipeline contains a rectangular shaped defect with a given depth d and length l.Given a pipeline (D, t, s) with a defect (d, l), we can determine the pipeline's pressure resistance capacity (the maximum differential pressure the pipeline can with-stand before bursting).We let p FE denote the capacity coming from a Finite Element simulation of the physical phenomenon.From the theoretical capacity p FE , we model the true pipeline capacity as a function of p FE and X m , where X m is the model discrepancy.For simplicity we have assumed that X m does not depend on the type of pipeline and defect, and we will also assume that σ m is fixed, and only the mean µ m can be inferred from observations.Finally, given the pressure load p d , the limit state representing the transition to failure is then given as g = p c − p d , and the probability of failure is defined as P (g ≤ 0).If we let X be the random vector containing all of the nodes in Figure 2, then X represents a probabilistic model of the physical system.In this example, we want to model some of the uncertainty related to the defect size, the model uncertainty, and the capacity as epistemic.We assume that the defect depth d has a fixed but unknown value, that can be inferred through observations that include noise.Similarly, the model uncertainty X m can be determined from experiments.Uncertainty with respect to p FE comes from the fact that evaluating the true value of p FE |(D, t, s, d, l) involves a time-consuming numerical computation.Hence, p FE can only be known for a finite, and relatively small set of input combinations.We can let pFE denote a stochastic process that models our uncertainty about p FE .To construct a PDT from X we will let pFE take the place of p FE , and specify that d, µ m and pFE are epistemic, i.e.E = σ(d, µ m , pFE ).
If we want a way to update the epistemic uncer-tainty based on observations, we also need to specify the relevant data generating process.In this example, we assume that there are three different ways of collecting data: As the defect measurements requires specification of an additional random variable, we have to include ε or (d/t) obs = d/t + ε in X as part of the complete probabilistic model.This would then define a PDT where epistemic updating is possible.
The physical system that the PDT represents in this example is rarely viewed in isolation.For instance, the random variables representing the pipeline geometry and material are the result of uncertainty or variations in how the pipeline has been manufactured, installed and operated.And the size of the defect is the result of a chemical process, where scientific models are available.It could therefore be natural to view the PDT from this example as a component of a bigger PDT, where probabilistic models of the manufacturing, operating conditions and corrosion process etc. are connected.This form of modularity is often emphasized in the discussion of digital twins, and likewise for the kind of Bayesian network type of models as considered in this example.

Sequential decision making
We now consider how the PDT framework may be adopted in real-world applications.As with any statistical model of this form, the relevant type of applications are related to prediction and inference.Since the PDT is supposed to provide a one-to-one correspondence (including uncertainty) with a real physical system, we are interested in using the PDT to understand the consequences of actions that we have the option to make.In particular, we will consider the discrete sequential decision making scenario, where get the opportunity to make a decision, receive information related to the consequences of this decision, and use this to inform the next decision, and so on.
In this kind of scenario, we want to use the PDT to determine an action or policy for how to act optimally (with respect to some case-specific criterion).By a policy here we mean the instructions for how to select among multiple actions given the information available at each discrete time step.We describe this in more detail in Section 4.4 where we discuss how the PDT is used for planning.When we make use of the PDT in this way, we consider the PDT as a "mental model" of the real physical system, which an agent uses to evaluate the potential consequences of actions.The agent then decides on some action to make, observes the outcome, and updates her beliefs about the true system, as illustrated in Figure 3.

Simulate and plan
Figure 3: A PDT as a mental model of an agent taking actions in the real world.As new experience is gained, the PDT may be updated by changing the structural assumptions M that defined the probability measure P , or updating belief with respect to epistemic events through conditioning on the new set of information I.The changes in structural assumptions and epistemic information are represented by ∆M and ∆I respectively.As part of the planning process, the PDT may simulate possible scenarios as indicated by the inner circle.
Whether the agent applies a policy or just an action (the first in the policy) before collecting information and updating the probabilistic model depends on the type of application at hand.In general it is better to update the model as often as possible, preferably between each action, but the actual computational time needed to perform this updating might make it impossible to achieve in practice.

Mathematical framework of sequential decision making
In this section, we briefly recap the mathematical framework of stochastic, sequential decision making in discrete time.We first recall the general framework, and in the following Section 4.2, we show how this relates to our definition of a PDT.
Let t = 0, 1, 2, . . ., N − 1 and consider a discrete time system where the state of the system, {x t } t≥1 , is given by Here, x t is the state of the system at time t, u t is a control and w t is a noise, or random parameter at time t.Note that the control, u t , is a decision which can be made by an agent (the controller ) at time t.This control is to be chosen from a set of admissible controls A t (possibly, but not necessarily depending on time).Also, f t , t = 0, 1, 2, . . ., N − 1 are functions mapping from the space of state variables (state space), controls and noise into the set of possible states of {x t } t≥0 .The precise structure of the state space, set of admissible controls and the random parameter space depends on the particular problem under consideration.Note that due to the randomness in w t , t = 0, 1, 2, . . ., N − 1, the system state x t and control u t , t = 1, 2, . . ., N − 1 also become random variables.
We remark that because of this, the state equation is sometimes written in the following form, where ω ∈ Ω is a scenario in a scenario space Ω (representing the randomness).Sometimes, the randomness is suppressed for notational convenience, so the state equation becomes x t+1 = f t (x t , u t ), t = 0, 1, 2, . . ., N − 1.
Note that in the state equation ( 5) (alternatively, equation ( 6)), x t+1 only depends on the previous time step, i.e., x t , u t , w t .This is the Markov property (as long as we assume that the distribution of w t does not depend on past values of w s , s = 0, 1, . . .t − 1, only x t , u t ).That is, the next system state only depends on the previous one.Since this Markovian framework is what will be used throughout this paper as we move on to reinforcement learning for a probabilistic digital twin, we focus on this.However, we remark that there is a large theory of sequential decision making which is not based on this Markovianity.This theory is based around maximum principles instead of dynamic programming, see the following Section 4.3 for more on this.
The aim of the agent is to minimize a cost function under the state constraint (5) (or alternatively, ( 6)).We assume that this cost function is of the following, additive form, where the expectation is taken with respect to an a priori given probability measure.That is, we sum over all instantaneous rewards h t (x t , u t , w t ), t = 0, 1, . . ., N − 1 which depend on the state of the system, the control and the randomness and add a terminal reward g(x N ) which only depends on the system state at the terminal time t = N .This function is called the objective function.
Typically, we assume that the agent has full information in the sense that they can choose the control at time t based on (fully) observing the state process up until this time, but that they are not able to use any more information than this (future information, such as inside information).This problem formulation is very similar to that of continuous time stochastic optimal control problem.
Remark 4.1.(A note on continuous time) This framework is parallel to that of stochastic optimal control in continuous time.The main differences in the framework in the continuous time case is that the state equation is typically a stochastic differential equation, and the sum is replaced by an integral in the objective function.For a detailed introduction to continuous time control, see e.g., Øksendal [13].
Other versions of sequential decision making problems include inside information optimal control, partial information optimal control, infinite time horizon optimal control and control with various delay and memory effects.One can also consider problems where further constraints, either on the control or the state, is added to problem (8).
In Bertsekas [14], the sequential decision making problem ( 8) is studied via the dynamic programming algorithm.This algorithm is based on the Bellman optimality principle, which says that an optimal policy chosen at some initial time, must be optimal when the problem is re-solved at a later stage given the state resulting from the initial choice.

Sequential decision making in the PDT
Now, we show how the sequential decision making framework from the previous section can be used to solve sequential decision making problems in the PDT.
We may apply this sequential decision making framework to our PDT by letting That is, the state process for the PDT sequential decision making problem is the random vector of attributes X t .Note that in Definition 3.3, there is no time-dependency in the attributes X.However, since we are interested in considering sequential decision making in the PDT, we need to assume that there is some sort of development over time (or some indexed set, e.g.information) of the PDT.
Hence, the stochastic sequential decision making problem of the PDT-agent is to choose admissible controls u t , t = 0, 1, 2, . . ., N − 1 in order to, such that Here, the set of admissible controls, {u t } t≥0 ∈ A, are problem specific.So are the functions h t , g t and f t for t ≥ 0. Given a particular problem, these functions are defined based on the goal of the PDTagent as well as the updating of the PDT given new input.

Two solution methods for sequential decision making
In the literature on (discrete time) stochastic sequential decision making, there are two main approaches: • The dynamic programming principle (DPP).

• The Pontyagrin maximum principle (MP).
The continuous time analogues are the Hamilton-Jacobi-Bellman equations (a particular kind of partial differential equation) and the stochastic maximum principle, respectively.
The DPP is based on the Bellman optimality principle, and the resulting Bellman equation which can be derived from this.Dynamic programming has a few important advantages.It is tractable from an algorithmic perspective because of the backpropagation algorithm naturally resulting from the Bellman equation.Furthermore, the method is always well defined since it is based on working with the value function (the function that maps states to the optimal value given by ( 9), assuming that we start from the given state).However, there are some downsides to the DPP method as well.Firstly, DPP requires a Markovian framework (or that we can transform the problem to a Markovian framework).Also, the DPP requires that the Bellman equation holds.This may not be the case if we have problems with for example non-exponential discounting (with respect to time).In this case, we say that there are problems with time-inconsistency, see e.g., Rudloff et al. [15].For instance, traditional risk measures such as value-at-risk (VaR) and conditional-value at risk (CVaR) are time-inconsistent in this sense, see Cheridito and Stadje [16], and Artzner et al. [17] respectively.Hence, we run into time-inconsistency issues when e.g., minimizing the conditional-valueat-risk of some financial position, if we are using the DPP method.Finally, we cannot have state constraints when using the DPP method, since this causes discontinuity of the value function, see Hao and Li [18].
The alternative approach to solving stochastic sequential decision making problems is via the Pontryagin maximum principle.This method does not require Markovianity or depend on Bellman equation.Hence, there are no problems with timeinconsistency.However, the MP approach is less tractable from an algorithmic point of view.Furthermore, the MP approach requires existence of a minimizing (or maximizing) control.This may not be the case, since it is possible that only limiting control processes converging to the minimum (maximum) exist.
The pros and cons of dynamic programming and the maximum principle approach carry over in continuous time.
From a computational point of view, the dynamic programming method suffers from the curse of dimensionality.When doing numerical backward induction in the DPP, the objective function must be computed for each combination of values.This makes the method too computationally demanding to be applicable in practice for problems where the state space is large, see Agrell and Dahl [2] for a discussion of this.Until recently, numerical algorithms based on the maximum principle were not frequently studied in the literature, an exception is Bonnans [19].However, the MP approach leads to systems of backward differential equations, in the continuous case, which are often computationally demanding and also less tractable from an algorithmic point of view than the DPP method.However, with the advances of machine learning over the past decade, some new approaches based on the MP approach using deep learning have been introduced, see Li et al. [20].
Actually, reinforcement learning (RL) is essentially the DPP method.Hence, RL algorithms also suffer from the curse of dimensionality, see Sutton and Barto [21].This means that most RL algorithms become less efficient when the dimension of the state space increases.However, by using function approximation the curse of dimensionality can often be efficiently handled, see Arulkumaran et al. [22].
The purpose of this paper is to build on this literature by connecting deep reinforcement learning (so essentially, the dynamic programming method) to probabilistic digital twins in order to do planning with respect to the PDT.This is the topic of the following section.

Planning in the PDT
In this section, we discuss how the PDT can be used for planning.That is, how we use the PDT to identify an optimal policy, without acting in the real world, but by instead simulating what will happen in the real world given that the agent chooses specific actions (or controls, as they are called in the sequential decision making literature, see Section 4.1).We use the PDT as a tool to find a plan (policy), or a single action (first action of policy), to perform in the real world.
In order to solve our sequential decision making problem in the PDT, we have chosen to use a reinforcement learning formulation.As remarked in Section 4.3, this essentially corresponds to choosing the dynamic programming method for solving the optimal control problem (as opposed to a maximum principle approach).Because we will use a DPP approach, we need all the assumptions that come with this, see the discussion in Section 4.3: A Markovian framework, or the possibility of transforming the problem to something Markovian.We need the Bellman equation to hold in order to avoid issues with time-inconsistency.In order to ensure this, we for example need to use exponential discounting and not have e.g., conditional expectation of state process in a non-linear way in the objective function.Finally, our planning problem cannot have state constraints.Remark 4.2.Instead of using the DPP to solve the planning problem, we could use a maximum principle approach.One possible way of doing this in practice, is by using one of the MP based algorithms found in Li et al. [20], instead of using reinforcement learning.By this alternative approach, we avoid the Markovianity requirement, possible timeinconsistency issues and can allow for state constraints (via a Lagrange multiplier method -see e.g., Dahl and Stokkereit [23]).This topic is beyond the scope of this paper, but is a current work in progress.
Starting with an initial PDT as a digital representation of a physical system given our current knowledge, we assume that there are two ways to update the PDT: 1. Changing or updating the structural assumptions M , and hence the probability measure P M .2. Updating the information I.
The structural assumptions M are related to the probabilistic model for X.Recall from Section 2.4, that these assumptions define the probability measure P M .Often, this probability measure is taken as given in stochastic modeling.However, in practice, probability measures are not given to us, but decided by analysts based on previous knowledge.Hence, the structural assumptions M may be updated because of new knowledge, external to the model, or for other reasons the analysts view as important.
Updating the information is our main concern in this paper, since this is related to the agent making costly decisions in order to gather more information.An update of the information also means (potentially) reducing the epistemic uncertainty in the PDT.Optimal information gathering in the PDT will be discussed in detail in the following Section 4.7.

MDP, POMDP and its relation to DPP
In this section, we briefly recall the definitions of Markov decision processes, partially observable Markov decision processes and explain how these relate to the seuqntial decision making framework of Section 4.1.
Markov decision processes (MDP) are discretetime stochastic control processes of a specific form.An MDP is a tuple (S, A, P a , R a ), where S is a set of states (the state space) and A is a set of actions (action space).Also, is the probability of going from state s at time t to state s at time t + 1 if we do action a at time t.Finally, R a (s, s ) is the instantaneous reward of transitioning from state s at time t to state s at time t + 1 by doing action a (at time t).
An MDP satisfies the Markov property, so given that the process is in state s and will be doing a at time t, the next state s t+1 is conditionally independent of all other previous states and actions.

Remark 4.3. (MDP and DPP)
Note that this definition of an MDP is essentially the same as our DPP framework of Section 4.1.In the MDP notation, we say actions, while in the control notation, it is common to use the word control.In Section 4.1, we talked about instantaneous cost functions, but here we talk about instantaneous rewards.Since minimization and maximization problems are equivalent (since inf{•} = − sup{−•}), so are these two concepts.Furthermore, the definition of the transition probabilities P a in the MDP framework corresponding to the Markov assumption of the DPP method.In both frameworks, we talk about the system states, though in the DPP framework we model this directly via equation (5).
A generalization of MDP are partially observable Markov decision processes (POMDPs).While an MDP is a 4-tuple, a POMDP is a 6-tuple, (S, A, P a , R a , Ω, O).
Here (like before), S is the state space, A is the action space, P a give the conditional transition probabilities between the different states in S and R a give the instantaneous rewards of the transitions for a particular action a.
In addition, we have Ω, which is a set of observations.In contrast to the MDP framework, with POMDP, the agent no longer observes the state s directly, but only an observation o ∈ Ω.Furthermore, the agent knows O which is a set of conditional observation probabilities.That is, is the probability of observing o ∈ Ω given that we do action a from state s .
The objective of the agent in the POMDP sequential decision problem is to choose a policy, that is actions at each time, in order to where r t is the reward earned at time t (depending on s t , a t and s t+1 ), and λ ∈ [0, 1] is a number called the discount factor.The discount factor can be used to introduce a preference for immediate rewards as opposed to more distant rewards, which may be relevant for the problem at hand, or used just for numerical efficiency.Hence, the agent aims to maximize their expected discounted reward over all future times.Note that is it also possible to consider problem (10) over an infinite time horizon or with a separate terminal reward function as well.This is similar to the DPP sequential decision making framework of Section 4.1.
In order to solve a POMDP, it is necessary to include memory of past actions and observations.Actually, the inclusion of partial observations means that the problem is no longer Markovian.However, there is a way to Markovianize the POMDP by transforming the POMDP into a belief-state MDP.In this case, the agent summarizes all information about the past in a belief vector b(t), which is updated as time passes.See [24], Chapter 12.2.3 for details.

MDP (and POMDP) in the PDT framework
In this section, we show how the probabilistic digital twin can be incorporated in a reinforcement learning framework, in order to solve sequential decision problems in the PDT.
In Section 4.2, we showed how we can use the mathematical framework of sequential decision making to solve optimal control problems for a PDT-agent.Also, in Section 4.5, we saw (in Remark 4.3) that the MDP (or POMDP in general) framework essentially corresponds to that of the DPP.In theory, we could use the sequential decision making framework and the DPP to solve optimal control problems in the PDT.However, due to the curse of dimensionality, this will typically not be practically tractable (see Section 4.3).In order to resolve this, we cast the PDT sequential decision making problem into a reinforcement learning, in particular a MDP, framework.This will enable us to solve the PDT optimal control problem via deep reinforcement learning, in which there are suitable tools to overcome the curse of dimensionality.
To define a decision making process in the PDT as a MDP, we need to determine our state space, action space, (Markovian) transition probabilities and a reward function.
• The action space A: These are the possible actions within the PDT.These may depend on the problem at hand.In the next Section 4.7, we will discuss optimal information gathering, where the agent can choose between different types of experiments, at different costs, in order to gain more information.In this case, the action space is the set of possible decisions that the agent can choose between in order to attain more information.
• The state space S: We define a state as a PDT (or equivalently a version of a PDT that evolves in discrete time t = 0, 1, . . .).A PDT represents our belief about the current physical state of a system, and it is defined by some initial assumptions together with the information acquired through time.In practice, if the structural assumptions are not changed, we may let the information available at the current time represent a state.
This means that our MDP will consist of beliefstates, represented by information, from which inference about the true physical state can be made.This is a standard way of creating a MDP from a POMDP, so we can view the PDT state-space as a space of beliefs about some underlying partially observable physical state.
Starting from a PDT, we define the state space as all updated PDTs we can reach by taking actions in the action space A.
• The transition probabilities P a : Based on our chosen definition of the state space, the transition probabilities are the probabilities of going from one level of information to another, given the action chosen by the agent.For example, if the agent chooses to make decision (action) d, what is the probability of going from the current level of information to another (equal or better) level.This is given by epistemic conditioning of the PDT with respect to the given information set I = {(d, o)} based on the decisions d the new observation o.When it comes to updates of the structural assumptions M , we consider this as deterministic transitions.
• The reward R a : The reward function, or equivalently, cost function, will depend on the specific problem at hand.To each action a ∈ A, we assume that we have an associated reward R a .In the numerical examples in Section 5, we give specific examples of how these rewards can be defined.
As mentioned in Section 4.4, there are two ways to update the PDT: Updating the structural assumptions M and updating the information I.If we update the PDT by (only) adding to the information set I, we always have the Markov property.
If we also update M , then the preservation of the Markov property is not given.In this case, using a maximum principle deep learning algorithm instead of the DPP based deep RL is a possibility, see [20].
Remark 4.4.Note that in the case where we have a very simple PDT with only discrete variables and only a few actions, then the RL approach is not necessary.In this case, the DPP method as done in traditional optimal control works well, and we can apply a planning algorithm to the PDT in order to derive an optimal policy.However, in general, the state-action space of the PDT will be too large for this.Hence, traditional planning algorithms, and even regular RL may not be feasible due to the curse of dimensionality.In this paper, we will consider deep reinforcement learning as an approach to deal with this.We discuss this further in Section 5.
Note that what determines an optimal action or policy will of course depend on what objective the outcomes are measured against.That is, what do we want to achieve in the real world?There are many different objectives we could consider.In the following we present one generic objective related to optimal information gathering, where the PDT framework is suitable.

Optimal information gathering
A generic, but relevant, objective in optimal sequential decision making is simply to "improve itself".That is, to reduce epistemic uncertainty with respect to some quantity of interest.Another option, is to consider maximizing the Kullback-Leibler divergence with respect to epistemic uncertainty as a general objective.This would mean that we aim to collect the information that "will surprise us the most".
By definition, a PDT contains an observational model related to the data generating process (the epistemic conditioning relies on this).This means that we can simulate the effect of gathering information, and we can study how to do this optimally.In order to define what we mean by an optimal strategy for gathering information, we then have to specify the following, • Objective: What we need the information for.
For example, what kind of decision do we intend to support using the PDT?Is it something we want to estimate?What is the required accuracy needed?For instance, we might want to reduce epistemic uncertainty with respect to some quantity, e.g., a risk metric such as a failure probability, expected extreme values etc.
• Cost: The cost related to the relevant information-gathering activities.
Then, from the PDT together with a specified objective and cost, one alternative is to define the optimal strategy as the strategy that minimizes the (discounted) expected cost needed to achieve the objective (or equivalently achieves the objective while maximizing reward).
Example 4.5.(Coin flip -information gathering) Continuing from Example 3.1, imagine that before making your final bet, you can flip the coin as many times as you like in order to learn about θ.Each of these test flips will cost 10.000$.You also get the opportunity to replace the coin with a new one, at the cost of 100.000 $.
An interesting problem is now how to select an optimal strategy for when to test, bet or replace in this game.And will such a strategy be robust?What if there is a limit on the total number of actions than can be performed?In Section 5.5 we illustrate how reinforcement learning can be applied to study this problem, where the coin represents a component with reliability θ, that we may test, use or replace.

Deep Reinforcement Learning with PDTs
In this section we give an example of how reinforcement learning can be used for planning, i.e. finding an optimal action or policy, with a PDT.The reinforcement learning paradigm is especially relevant for problems where the state and/or action space is large, or dynamical models where specific transition probabilities are not easily attainable but where efficient sampling is still feasible.In probabilistic modelling of complex physical phenomena, we often find ourselves in this kind of setting.

Reinforcement Learning (RL)
Reinforcement learning, in short, aims to optimize sequential decision problems through sampling from a MDP (Sutton and Barto [21]).We think of this as an agent taking actions within an environment, following some policy π(a|s), which gives the probability of taking action a if the agent is currently at state s.Generally, π(a|s) represents a (possibly degenerate) probability distribution over actions a ∈ A for each s ∈ S. The agent's objective is to maximize the amount of reward it receives over time, and a policy π that achieves this is called an optimal policy.
Given a policy π we can define the value of a state s ∈ S as where r t is the reward earned at time t (depending on s t , a t and s t+1 ), given that the agent follows policy π starting from s 0 = s.That is, for P a and R a given by the MDP, a t ∼ π(a t |s t ), s t+1 ∼ P at (s t , s t+1 ) and r t ∼ R at (s t , s t+1 ).Here we make use of a discount factor λ ∈ [0, 1] in the definition of cumulative reward.If we want to consider T = ∞ (continuing tasks) instead of T < ∞ (episodic task), then λ < 1 is generally necessary.The optimal value function is defined as the one that maximises (11) over all policies π.The optimal action at each state s ∈ S then corresponds to acting greedily with respect to this value function, i.e. selecting the action a t that in expectation maximises the value of s t+1 .Likewise, it is common to define the action-value function q π (s, a), which corresponds to the expected cumulative return of first taking action a in state s and following π thereafter.RL generally involves some form of Monte Carlo simulation, where a large number of episodes are sampled from the MDP, with the goal of estimating or approximating the optimal value of sates, state-action pairs, or an optimal policy directly.
Theoretically this is essentially equivalent to the DPP framework, but with RL we are mostly concerned with problems where optimal solutions cannot be found and some form of approximation is needed.By the use of flexible approximation methods combined with adaptive sampling strategies, RL makes it possible to deal with large and complex state-and action spaces.

Function approximation
One way of using function approximation in RL is to define a parametric function v(s, w) ≈ v π (s), given by a set of weights w ∈ R d , and try to learn the value function of an optimal policy by finding an appropriate value for w.Alternatively, we could approximate the value of a state-action pair, q(s, a, w) ≈ q π (s, a), or a policy π(a|s, w) ≈ π(a|s).The general goal is then to optimize w, using data generated by sampling from the MDP, and the RL literature contains many different algorithms designed for this purpose.In the case where a neural network is used for function approximation, it is often referred to as deep reinforcement learning.One alternative, which we will make use of in an example later on, is the deep Q-learning (DQN) approach as introduced by van Hasselt et al. [25], which represents the value of a set of m actions at a state s using a multi-layered neural network q(s, w) Note here that q(s, w) is a function defined on the state space S. In general, any approximation of the value functions v or q, or the policy π are defined on S or S × A. A question that then arises, is how can we define parametric functions on the state space S when we are dealing with PDTs?We can assume that we have control over the set of admissible actions A, in the sense that this is something we define, and creating parametric functions defined on A should not be a problem.But as discussed in Section 4.6, S will consist of belief -states.

Defining the state space
We are interested in an MDP where the transition probabilities P a (s, s ) corresponds to updating a PDT as a consequence of action a.In that sense, s and s are PDTs.Given a well-defined set of admissible actions, the state space S is then the set of all PDTs that can be obtained starting from some initial state s 0 , within some defined horizon.
Recall that going from s to s then means keeping track of any changes made to the structural assumptions M and the information I, as illustrated in Figure 3. From now on, we will for simplicity assume that updating the PDT only involves epistemic conditioning with respect to the information I.This is a rather generic situation.Also, finding a way to represent changes in M will have to be handled for the specific use case under consideration.Assuming some initial PDT s 0 is given, any state s t at a later time t is then uniquely defined by the set of information I t available at time t.Representing states by information in this way is something that is often done to transform a POMDP to a MDP.That is, although the true state s t at time t is unknown in a POMDP, the information I t , and consequently our belief about s t , is always know at time t.Inspired by the POMDP terminology, we may therefore view a PDT as a belief-state, which seems natural as the PDT is essentially a way to encode our beliefs about some real physical system.
Hence, we will proceed with defining the state space S as the information state-space, which is the set of all sets of information I.Although this is a very generic approach, we will show that there is a way of defining a flexible parametric class of functions on S. But we must emphasize that that if there are other ways of compressing the information I, for instance due to conjugacy in the epistemic updating, then this is probably much more efficient.3), all of our belief with respect to epistemic uncertainty is represented by the number ψ = P (θ = θ 1 ).Given some observation Y = y ∈ {0, 1}, the epistemic conditioning corresponds to where, for j = 1, 2, β j (y) = θ j if y = 0 and β j (y) = 1 − θ j if y = 1.
In this example, the information state-space consists of all sets of the form I t = {y 1 , . . ., y t } where each y i is binary.However, if the goal is to let I t be the representation of a PDT, we could just as well use ψ t , i.e. define S = [0, 1] as the state space.Alternatively, the number of heads and tails (0s and 1s) provides the same information, so we could also make use of S = {0, . . ., N } × {0, . . ., N } where N is an upper limit on the total number of flips we consider.

Deep learning on the information state-space
Let S be a set of sets I ⊂ R d .We will assume that each set I ∈ S consists of a finite number of elements y ∈ R d , but we do not require that all sets I have the same size.We are interested in functions defined on S.
An important property of any function f that takes a set I as input, is permutation invariance.I.e.f ({y 1 , . . ., y N }) = f ({y κ(1) , . . ., y κ(N ) }) for any permutation κ.It can been shown that under fairly mild assumptions, that such functions have the following decomposition These sum decompositions were studied by Zaheer et al. [26] and later by Wagstaff et al. [27], which showed that if |I| ≤ p for all I ∈ S, then any continuous function f : S → R can be written as (13) for some suitable functions φ : R d → R p and ρ : R p → R. The motivation in [26,27] was to enable supervised learning of permutation invariant and set-valued functions, by replacing ρ and φ with flexible function approximators, such as Gaussian processes or neural networks.Other forms of decomposition, by replacing the summation in ( 13) with something else that can be learned, has also been considered by Soelch et al. [28].For reinforcement learning, we will make use of the form (13) to represent functions defined on the information states space S, such as v(s, w), q(s, a, w), or π(a|s, w), using a neural network with parameter w.In the remaining part of this paper we present two examples showing how this works in practice.

The "coin flip" example
Throughout this paper we have presented a series of small examples involving a biased coin, represented by X = (Y, θ).In Example 4.5 we ended by introducing a game where the player has to select whether to bet on, test or replace the coin.As a simple illustration we will show how reinforcement learning can be applied in this setting.
But now, we will imagine that the coin Y represents a component in some physical system, where Y = 0 corresponds to the component functioning and Y = 1 represents failure.The probability P (Y = 1) = 1 − θ is then the components failure probability, and we say that θ is the reliability.
For simplicity we assume that θ ∈ {0.5, 0.99}, and that our initial belief is P (θ = 0.5) = 0.5.That is, when we buy a new component, there is a 50 % chance of getting a "bad" component (that fails 50 % of the time), and consequently a 50 % probability of getting a "good" component (that fails 1 % of the time).
We consider a project going over N = 10 days.Each day we will decide between one of the following 4 actions: We will find a deterministic policy π : S → A that maps from the information state-space to one of the four actions.The information state-space S is here represented by the number of days left of the project, n = N − t, and the set I t of observations of the component that is currently in use at time t.If we let S Y contain all sets of the form I = {Y 1 , . . ., Y t }, for Y t ∈ {0, 1} and t < N , then represents the information state-space.In this example we made use of the deep Q-learning (DQN) approach described by van Hasselt et al. [25], where we define a neural network q(s, w) : S → R 4 , that represents the action-value of each of the four actions.The optimal policy is then obtained by at each state s selecting the action corresponding to the maximal component of q.
We start by finding a policy that optimizes the cumulative reward over the 10 days (without discounting).As it turns out, this policy prefers to "gamble" that the component works rather than performing tests.In the case where the starting component is reliable (which happens 50 % of the time), a high reward can be obtained by selection action 3 at every opportunity.The general "idea" with this policy, is that if action 3 results in failure, the following action is to replace the component (action 2), unless there are few days left of the project in which case action 0 is selected.We call this the "unconstrained" policy.
Although the unconstrained policy givens the largest expected reward, there is an approximately 50 % chance that it will produce a failure, i.e. that action 3 is selected with Y = 1 as the resulting outcome.One way to reduce this failure probability, is to introduce the constraint that action 3 (using the component) is not allowed unless we have a certain level of confidence in that the component is reliable.We introduced this type of constraint by requiring that P (θ = 0.99) > 0.9 (a constraint on epistemic uncertainty).The optimal policy under this constraint will start with running experiments (action 1), before deciding whether to replace (action 2), use the component (action 3), or terminate the project (action 0). Figure 4 shows a histogram of the cumulative reward over 1000 simulated episodes, for the constrained and unconstrained policies obtained by RL, together with a completely random policy for comparison.: Total reward after 1000 episodes for a random policy, the unconstrained policy, and the agent which is subjected to the constraint that action 3 is not allowed unless P (θ = 0.99) > 0.9.
In this example, the information state-space could also be defined in a simpler way, as explained in Example 5.1.As a result the reinforcement learning task will be simplified.Using the different statespace representations, we obtained the same results shown in Figure 4. Finally, we should note that in the case where defining the state space as in ( 14) is necessary, the constraint P (θ = 0.99) > 0.9 is not practical.That is, if we could estimate this probability efficiently, then we also have access to the compressed information state-space.One alternative could then be to instead consider the uncertain failure probability p f (θ) = P (Y = 1 | θ), and set a limit on e.g.E[p f ] + 2 • Std(p f ).This is the approach taken in the following example concerning failure probability estimation.

Corroded pipeline example
Here we revisit the corroded pipeline example from Agrell and Dahl [2] which we introduced in Section 3.4.In this example, we have specified epistemic uncertainty with respect to model discrepancy, the size of a defect, and the capacity p FE coming from a Finite Element simulation.If we let θ be the epistemic generator, we can write the failure probability conditioned on epistemic information as p f (θ) = P (g ≤ 0 | θ).In [2] the following objective was considered: Determine with confidence whether p f (θ) < 10 −3 .That is, when we consider p f as a purely epistemic random variable, we want to either confirm that the failure probability is less than the target 10 −3 (in which case we can continue operations as normal), or to detect with confidence that the target is exceeded (and we have to intervene).Will say the the objective is achieved if we obtain either E[p f ] + 2 • Std(p f ) < 10 −3 or E[p f ]−2•Std(p f ) > 10 −3 (where E[p f ] and Std(p f ) can be efficiently approximated using the method developed in [2]).There are three ways in which we can reduce epistemic uncertainty: 1. Defect measurement: Noise perturbed measurement that reduces uncertainty in the defect size d 2. Computer experiment: Evaluate p FE at some selected input (D, t, s, d, l), to reduce uncertainty in the surrogate pFE used to approximate p FE .3. Lab experiment: Obtain one observation of X m , which reduces uncertainty in µ m .
The set of information corresponding to defect measurements is I Measure ⊂ R as each measurement is a real valued number.Similarly, I Lab ⊂ R as well, and I FE ⊂ R 6 when we consider a vector y ∈ R 6 as an experiment [D, t, s, d, l, p FE ].Actually, in this example we may exploit some conjugacy in in the representation of I Measure and I Lab as discussed in Example 5.1 (see [2] for details), so we can define the information state-space as S = S FE ×R 2 , where S FE consists of finite subsets of R 6 .
We will use RL to determine which of the three types of experiment to perform, and define the action space A = {Measurement, FE, Lab}.Note that when we decide to run a computer experiment, we also have to specify the input (D, t, s, d, l).This is a separate decision making problem regarding design of experiments.For this we make use of the myopic (one-step lookahead) method developed in [2], although one could in principle use RL for this as well.This kind of decision making, where one first decides between different types of task to perform, and then proceed to find a way to perform the selected task optimally, is often referred to as hierarchical RL in the reinforcement learning literature.Actually, [2] considers a myopic alternative for also selecting between the different types of experiments, and it was observed that this might be challenging in practice if there are large differences in cost between the experiments.This was the motivation for studying the current example, where we now define the reward (cost) r as a direct consequence of a ∈ A as follows: r = −10 for a = Measurement, r = −1 for a = Lab and r = −0.1 for a = FE.
In this example we also made use of the DQN approach of van Hasselt et al. [25], where we define a neural network q(s, w) : S = S FE × R 2 → R 3 , that gives, for each state s, the (near optimal) value of each of the three actions.We refrain from describing all details regarding the neural network and the specific RL algorithm, as the main purpose with this example is for illustration.But we note that two important innovations in the DQN algorithm, the use of a target network and experience replay as proposed in [29], was necessary for this to work.
The objective in this RL example is to estimate a failure probability using as little resources as possible.If an agent achieves the criterion on epistemic uncertainty reduction, that the expected failure probability plus/minus two standard deviations is either above or below the target value, we say that the agent has succeeded and we report the sum of the cost of all performed experiments.We also set a maximum limit of 40 experiments.I.e. after 40 tries the agent has failed.To compare the policy obtained by RL, we consider the random policy that selects between the three actions uniformly at random.We also consider a more "human like" benchmark policy, that corresponds to first running 10 computer experiments, followed by one lab experiment then one defect measurement, then 10 new computer experiments, and so on.
The final results from simulating 100 episodes with each of the three policies is shown in Figure 5.For the random and benchmark policy, the success rate was around 60% (to achieve the objective within 40 experiments in total), whereas 94 % was successful for the RL agent.

Concluding remarks
To conclude our discussion, we recall that in this paper, we have: • Given a measure-theoretic discussion of epistemic uncertainty and formally defined epistemic conditioning.
• Provided a mathematical definition of a probabilistic digital twin (PDT).
• Connected PDTs with sequential decision making problems, and discussed several solution approaches (maximum principle, dynamic programming, MDP and POMDP).
• Argued that using (deep) RL to solve sequential decision making problems in the PDT is a good choice for practical applications today.
• For the specific use-case of optimal information gathering, we proposed a generic solution using deep RL on the information state-space.
Further research in this direction includes looking at alternative solution methods and RL algorithms in order to handle different PDT frameworks.A possible idea is to use a maximum principle approach instead of a DPP approach (as is done in RL).By using one of the MP based algorithms in [20], we may avoid the Markovianity requirement, possible time-inconsistency issues and can also allow for state constraints.For instance, this is of interest when the objective of the sequential decision making problem in the PDT is to minimize a risk measure such as CVaR or VaR.Both of these risk measures are known to cause time-inconsistency in the Bellman equation, and hence, the DPP (and also RL) cannot be applied in a straightforward manner.This is work in progress.
and the updated marginal (predictive) distribution for Y becomes P new (Y ) = P (Y |θ)dP (θ|I).

Example 3 . 2 .
(UQ -robust decisions) This example is a continuation of Example 2.4 and Example 2.7.

Figure 1 :
Figure 1: A standard regression model as a PDT.

Figure 2 :
Figure 2: Graphical representation of the corroded pipeline structural reliability model.The shaded nodes d, p FE and µm have associated epistemic uncertainty.
Example 5.1 below shows exactly what we mean by this.Example 5.1.(Coin flip -information statespace) In the coin flip example (Example 2.

Figure 4
Figure4: Total reward after 1000 episodes for a random policy, the unconstrained policy, and the agent which is subjected to the constraint that action 3 is not allowed unless P (θ = 0.99) > 0.9.

Figure 5 :
Figure5: Total cost (negative reward) after 100 successful episodes.For the random and benchmark policy, the success rate was around 60% (to achieve the objective within 40 experiments in total), whereas 94 % was successful for the RL agent.

1 .
Defect measurement: We assume that noise perturbed observations of the relative depth, d/t + ε, can be made.2. Computer experiment: Evaluate p FE at some selected input (D, t, s, d, l). 3. Lab experiment: Obtain one observation of X m .