Detection and identification of changes of hidden Markov chains: asymptotic theory

This paper revisits a unified framework of sequential change-point detection and hypothesis testing modeled using hidden Markov chains and develops its asymptotic theory. Given a sequence of observations whose distributions are dependent on a hidden Markov chain, the objective is to quickly detect critical events, modeled by the first time the Markov chain leaves a specific set of states, and to accurately identify the class of states that the Markov chain enters. We propose computationally tractable sequential detection and identification strategies and obtain sufficient conditions for the asymptotic optimality in two Bayesian formulations. Numerical examples are provided to confirm the asymptotic optimality.


Introduction
In this paper, we revisit the joint problem of sequential change-point detection and hypothesis testing generalized in terms of hidden Markov chains. For a sequence of random variables whose distributions are functionals of a hidden Markov chain, the objective is to quickly detect the disorder, described by the event in which the hidden Markov chain leaves a specific set of states, and to accurately identify its cause, as represented by the class of states into which the Markov chain is absorbed. The problem reduces to solving the trade-off between minimizing Kazutoshi Yamazaki was in part supported by KAKENHI Grant No. 19H01791 and 20K03758 and JSPS Open Partnership Joint Research Projects grant No. JPJSBP120209921 Fig. 1 Transition of the status of a contagious disease. The node (H i , α), i = 0, 1 and α = A,B,C,B&C, corresponds to the state where the hypothesis H i is true and County α is infected. The node (H i , extinct) is where H i is true and the disease has become extinct before reaching A the expected detection delay and the false alarm and misdiagnosis probabilities. A Bayesian formulation was studied in Dayanik and Goulding (2009).
The sequential change-point detection, hypothesis testing, and their combinations are applied in a wide array of fields. Classic examples include signal, speech, and image processing; radio astronomy; finance/economics; and seismology. Their methodologies are often essential in the control of epidemics; see, e.g., (Baron 2004;Yu et al. 2013), which explores the detection of an influenza outbreak. The common objective in these applications is to derive efficient stopping rules that minimize the required observation size and the false alarm/misidentification probabilities. For a comprehensive review of this subject, see, e.g., (Poor 2013;Tartakovsky et al. 2014;Tartakovsky 2020).
While classical formulations have focused on settings with i.i.d. (independently and identically distributed) observations and simple (usually binary) decision rules, real-life decision-making is often more complex. Therefore, most of the past research on this subject has extended the classical settings to accommodate more realistic scenarios -typically by relaxing the i.i.d. assumptions and allowing for more complex (multiary) decision rules. This paper discusses one way to generalize using hidden Markov chains.
To motivate the Markov chain model studied in this paper, consider the following problem, as graphically illustrated in Fig. 1. There are three counties A, B, and C facing an infectious disease. Suppose a case of infections is reported in County B, and the agency of County A must promptly detect the event of the infection transmission to County A. Initially, two hypotheses exist regarding the disease: human-to-human transmission possible (H 1 ) and its negation (H 0 ). Suppose Counties A, B and C are adjacent to each other and transmission can occur between these counties, except for the route from County C to A under H 0 . The agency wants to quickly detect two events: transmission to County A and the disease becoming extinct before reaching County A. Hypothesis H 1 versus H 0 must also be identified to take suitable actions. Note that this formulation can be applied more widely outside of epidemic control; for example, rather than disease, computer viruses can be studied, or rumors that tend to change forms through social networks can be analyzed.
These dynamics can be efficiently modeled via a Markov chain, say Y = (Y n ) n≥0 : the decision-maker wants to detect the first time Y enters one of the four shaded nodes and to identify which node is entered. However, Y is not directly observable, and one must make a guess through indirect observations, say X = (X n ) n≥0 . This problem includes the features of both change-point detection and sequential hypothesis testing. The decision-maker must select the time to detect critical events and identify the true status of the disease (Y ) to take appropriate actions. More precisely, one observes a sequence of random variables whose distributions are functionals of a hidden Markov chain Y . The objective is to as quickly as possible detect the event that the hidden Markov chain leaves a specific set of states and to accurately identify the class of states into which the Markov chain is absorbed.
The scenario in Fig. 1 is merely one example, but the expanded Markov chain is capable of modeling various decision-making problems in various fields. The classical change-point detection with geometrically distributed disorder time and binary hypothesis testing under i.i.d. observations can be modeled by two-state Markov chains. Additional states to the Markov chain enable the modeling of the sequential change diagnosis (detection/isolation) problem, which was first studied in Nikiforov (1995) for the non-Bayesian (minimax) formulation and was further elaborated by, e.g., (Lai 2000;Nikiforov 2000Nikiforov , 2003Oskiper and Poor 2002;Tartakovsky 2008); the Bayesian model has been studied in Dayanik et al. (2008). In fact, the range of problems the hidden Markov model encompasses is broad. For example, the geometrically distributed disorder time can be generalized to a phase-type distribution (the distribution of the absorption time of a Markov chain); see the examples described in Sections 1 and 2 of Dayanik and Goulding (2009).
There are two main research approaches of this subject -(i) to find the means to compute an optimal solution and (ii) to design asymptotically optimal solutions that are easy to compute and implement. In the first direction, the problem can typically be expressed in terms of the optimal stopping of the posterior probability process of each alternative hypothesis. However, few examples admit analytical solutions, and in practice, one needs to rely on numerical approximations, for example, via value iteration in combination with the discretization of the space of the posterior probability process. The computational burden and nontrivial computer representation of the optimal solution hinder the application of the findings of this first direction in practice. The second direction pursues a strategy that provides simple and scalable implementation but gives only near-optimal solutions. The asymptotic optimality as a certain parameter of the problem approaches an ideal value is commonly used as a proxy for the near-optimality.
Asymptotically optimal strategies are in most cases derived via renewal and nonlinear renewal theories (see Tartakovsky et al. 2014 for a comprehensive reference).
In the sequential (multiple) hypothesis testing with i.i.d. observations, the log-likelihood ratio (LLR) processes become conditional random walks. By utilizing the ordinary renewal theory, the asymptotic behaviors of the expected sample size and the misidentification costs can be approximated; see, for example, (Baum and Veeravalli 1994). Similar approaches are possible for change-point detection. In particular, when the disorder time is geometrically distributed and the observations are conditionally i.i.d., ordinary renewal theory can be applied to the LLR processes, which are conditional random walks.
On the other hand, when the observed random variables are not i.i.d. or when the changepoint is not geometrically distributed, the asymptotic optimality is in general not guaranteed; instead, the existing literature typically shows that the r -quick convergence of Lai (1977Lai ( , 1981 of a certain LLR process is a sufficient condition for asymptotic optimality. Tartakovsky (1998) and Dragalin et al. (2000) generalized Lai's results to multi-hypothesis sequential tests and for more general models; Dragalin et al. (2000) further obtained higherorder approximations by taking into account the overshoots at up-crossing times of the LLR processes. As for the change-point detection, Tartakovsky and Veeravalli (2004b) consider the non-i.i.d. case and show the asymptotic optimality of the Shiryaev procedure under the r -quick convergence. Its continuous-time version is studied (Baron and Tartakovsky 2006). Dayanik et al. (2013) obtained asymptotically optimal strategies for sequential change diagnosis, showing that the r -quick convergence is again a sufficient condition for asymptotic optimality.
Recently, Tartakovsky (2017) successfully obtained a weaker alternative sufficient condition, known as the r -complete convergence, for the non-i.i.d. case of the change-point detection. The r -quick convergence condition can be replaced by the r -complete convergence condition for a more general class of problems. For a comprehensive account on both analytical and asymptotic optimality of the change-point detection and sequential hypothesis testing, we refer the reader to Tartakovsky et al. (2014);Tartakovsky (2020). For up-to-date results on the general detection-identification problem for non-i.i.d. data, see (Tartakovsky 2021).
This paper presents an asymptotic analysis of the detection and identification problem in terms of the hidden Markov chains described above and derives asymptotically optimal strategies, focusing on the following two Bayesian formulations: The minimum Bayes risk formulation minimization of the sum of the expected detection delay time and the false alarm and misdiagnosis probabilities (known as the Bayes risk). The Bayesian fixed-error-probability formulation minimization of the expected detection delay time subject to certain upper bounds on the false alarm and misdiagnosis probabilities.
The optimal strategy of the former was derived in Dayanik and Goulding (2009). The latter is usually solved through its Lagrange relaxation, which is a minimum Bayes risk problem where the costs are the Lagrange multipliers of the constraints on the false alarm and misdiagnosis probabilities. In theory, by employing a hidden Markov chain of an arbitrary number of states, a wide range of realistic models can result. However, the implementation is computationally feasible only for simple cases. The problem dimension is proportional to the number of states of the Markov chain, and the computation complexity increases exponentially, which hinders the applications of the hidden Markov model. In practice, obtaining exact optimal strategies is still limited to simple and classical examples.
We propose simple and asymptotically optimal strategies for both the minimum Bayes risk and the Bayesian fixed-error-probability formulations. The asymptotic analysis is similar for both formulations and can be conducted almost simultaneously. Similar to Dayanik et al. (2013), we show that the r -complete convergence for an appropriate choice of the LLR processes is a sufficient condition for asymptotic optimality. This is of particular importance because it was recently verified in Pergamenchtchikov and Tartakovsky (2018), Pergamenchtchikov and Tartakovsky (2019) that the r -complete convergence holds for a large class of Markov processes. We also show in certain cases that the limit can be analytically derived in terms of the Kullback-Leibler divergence. Through a sequence of numerical experiments, we further acknowledge the convergence results of the LLR processes and the asymptotic optimality of the proposed strategies.
The remainder of the paper is organized as follows. In Sect. 2, the two Bayesian formulations are defined. Section 3 presents strategies and the derivation of sufficient conditions for asymptotic optimality in terms of the r -complete convergence of the LLR processes. In Sect. 4, we present examples where the limits of the LLR processes can be analytically obtained via the Kullback-Leibler divergence. Section 5 concludes the paper with numerical results. Long proofs are deferred to the appendix.

Problem formulations
In this section, we define two Bayesian formulations: the minimum Bayes risk formulation (Problem 2.1) and the Bayesian fixed-error probability formulation (Problem 2.2). In particular, the former has been studied and its non-asymptotic solution has been derived in Dayanik and Goulding (2009).
Consider a probability space ( , F , P) hosting a time-homogeneous Markov chain Y = (Y n ) n≥0 with some finite state space Y, initial state distribution η = {η(y) ∈ [0, 1], y ∈ Y}, and one-step transition matrix P = {P(y, y ) ∈ [0, 1], y, y ∈ Y}. Suppose that Y 1 , . . . , Y M are M closed (but not necessarily irreducible) mutually disjoint subsets of the state space Y, and let Y 0 := Y \ M k=1 Y k . In other words, Y 0 is transient and the Markov chain Y eventually gets absorbed into one of the M closed sets. Let us define as the absorption time and the closed set that absorbs Y , respectively. Here because Y 0 is transient (i.e. θ < ∞ a.s.), the random variable μ is well-defined. We also define M := {1, . . . , M} and M 0 := M ∪ {0}.

Remark 2.1
In the example of Fig. 1 The Markov chain Y can be indirectly observed through another stochastic process X = (X n ) n≥1 defined on the same probability space ( , F , P). We assume that there exists a set of probability measures {P(y, dx); y ∈ Y} defined on some common measurable space (E, E) such that For every y ∈ Y, we assume that P(y, dx) admits a density function f (y, x) with respect to some σ -finite measure m on (E, E); namely, f (y, x)m(dx) = P(y, dx).

Remark 2.2
In this paper, we focus on the case the distribution of X is only dependent on the (unobservable) state of Y . However, in many applications such as the detection of the autoregressive (AR) process, the distribution of observation is also dependent on the past observations. For a hidden Markov model that encompasses this general setting, see (Fuh and Tartakovsky 2018) (see also Remark 2.5).

Remark 2.3
In practical applications, the parameters of the post-change observation distributions are unknown (composite hypothesis). The considered model assumes to know more concrete information on the post-change observation distribution and it does not cover the composite hypothesis cases existing in the literature. In general, the parameter set is not a finite set and one natural way to handle, as in many of the existing literature, is to consider the robust version using the worst-case error probabilities. We refer the reader to, for example, (Tartakovsky 2020, Chapter 7) for the Beyesian formulation of the composite case. One potential extension of the current hidden Markov model is to consider the composite version where the observation is further dependent on additional unknown parameter as well as the state of the Markov chain.

Remark 2.4
It is common in the literature to use X for the hidden Markov chain and Y for the observation process. In this paper, however, we follow the notations in Dayanik and Goulding (2009) and use Y for the hidden Markov chain and X for the observation process.
A (sequential decision) strategy (τ, d) is a pair of an F-stopping time τ (in short, τ ∈ F) and a random variable d : → M that is measurable with respect to the observation history F τ up to the stopping time τ (namely, d ∈ F τ ). Let be the set of strategies.
Our objective is to obtain a strategy (τ, d) that minimizes the m-th moment of the detection delay cost for some m ≥ 1 and deterministic nonnegative and bounded function c : Y → [0, ∞), as well as the terminal decision losses (TDLs) Regarding the function c, if we set c(y) = 0 for y ∈ Y 0 and c(y) = const for y / ∈ Y 0 , then it can model a classical expected detection delay E[((τ − θ) + ) m ]. By allowing c to be state-dependent, it gives more flexibility in modeling; see, e.g., the examples given in Dayanik and Goulding (2009).
The Bayes risk is a linear combination of all of these losses, for some m ≥ 1, c, and a set of strictly positive constants a = (a yi ) i∈M,y∈Y\Y i . In (2.1), while it is natural to assume c(y) = 0 for y ∈ Y 0 , we allow c(y) to take any nonnegative values for y ∈ Y 0 . On the other hand, in (2.2) and (2.3), we assume that any correct terminal decision (i.e., over the set of strategies and find a strategy (τ * , d * ) ∈ (R) that attains it, if such a strategy exists. (2018)) Problem 2.2 in our setting and the problem considered in Fuh and Tartakovsky (2018) complement each other. In Fuh and Tartakovsky (2018), they considered a version of change point detection (without identification) using a hidden Markov chain that changes its dynamics at an unobservable time θ , whose initial distribution is independent of the observation process.

Remark 2.5 (Connection with the hidden Markov model of Fuh and Tartakovsky
In Fuh and Tartakovsky (2018), the distribution of the observation X t at time t is a function of both the state of the hidden Markov chain Y t as well as the previous observation X t−1 , such that the conditional probability given the change has occurred and has not occurred can be written with P ( j) (y, dx; x 0 ) = f ( j) (y, x; x 0 )m(dx) for j = 0 (pre-change) and 1 (post-change).
When f ( j) (y, x; x 0 ) is invariant of x 0 (i.e. the distribution of observation only depends on the current state of the Markov chain), it can be modeled as a special case of Problem 2.2 with M = 1. To see this, consider the case when θ ∼ Geom( p) and the state space of the Markov chain Y isẼ. Construct two Markov chains on Y j = {(i, j); i ∈Ẽ}, j = 0, 1 with corresponding transition matrices P ( j) . Then, the model of Fuh and Tartakovsky (2018) corresponds to Problem 2.2 with a new Markov chain with state space Y = Y 0 ∪ Y 1 with the new transition matrix P given by, for i, i ∈Ẽ, This can be generalized by considering the case θ is phase-type (see Example 1 of Dayanik and Goulding (2009)) and modeling Y 0 in an analogous way using N copies of the Markov chain where N is the number of states necessary to describe the phase-type distribution.
With the framework of Dayanik and Goulding (2009), we can consider various generalizations such as the case with identification (M ≥ 2) and also the case when the disorder time θ depends on μ (see Example 2 of Dayanik and Goulding (2009)).
(2.4) Remark 2.6 Fix a set of positive constants R. We have In our analysis, we will need to reformulate the problem in terms of the conditional probabilities i be the expectations with respect to P i and P (t) i , respectively. We also let be the unconditional probability of the event that Y is absorbed by Y i . Because Y 0 is transient, we must have i∈M ν i = 1. Without loss of generality, we assume We decompose the Bayes risk such that In particular, if we set a yi = 1 for all y ∈ Y\Y i , using (2.4), (2.7)

Asymptotically optimal strategies
We now introduce two strategies. The first strategy triggers an alarm when the posterior probability of the event that Y has been absorbed by a certain closed set exceeds some threshold for the first time, and will be later proposed as an asymptotically optimal solution for Problem 2.1. The second strategy is its variant, expressed in terms of the log-likelihood ratio (LLR) processes and will be proposed as an asymptotically optimal solution for Problem 2.2. For all y ∈ Y, let ( n (y)) n≥0 be the posterior probability process defined by Then, for y ∈ Y, 0 (y) = η(y) and for n ≥ 1 (3.1) see Dayanik and Goulding (2009) for how these can be derived. Also define For the rest of the paper, we use the short-hand notations: α (i) n (X 1 , . . . , X n ) for n ≥ 1 and i ∈ M 0 . We also assume the following, which is imposed so that the LLR process is well-defined. Note, however, that this naturally holds except for pathological (and noninteresting) cases. This fails for example when θ is a deterministic constant (which can be modeled by using the transition matrix with elements 0 and 1) and M = 1.

Definition 3.1 ((τ A , d A )-strategy for the minimum Bayes risk formulation) Fix a set of strictly positive constants
where τ (i) Define the logarithm of the odds-ratio process Then, (3.5) can be rewritten as Definition 3.2 ((υ B , d B )-strategy for the Bayesian fixed-error-probability formulation) Fix a set of strictly positive constants B = (B i j ) i∈M, j∈M 0 \{i} , define Then we have Notice, by (3.6), that We will show that, by adjusting the values of A and B, the strategy (τ A , d A ) is asymptotically optimal in Problem 2.1 as for fixed a, and the strategy (υ B , d B ) is asymptotically optimal in Problem 2.2 as For the latter, we assume that, in taking limits, for some strictly positive constants (β i ) i∈M . This limit mode will still be denoted by " R ↓ 0" for brevity. We assume (3.11) for our asymptotic optimality results. We choose the values of the barriers B as functions of R, and, for our strategies to be asymptotically optimal, it is necessary to assume that R yi for each y needs to decrease at a similar speed (see Remark 3.4 and (3.22)). We will find functions A(c) and B(R) so that In fact, we will obtain results stronger than (3.12) and (3.13); we will show for every i ∈ M.

Convergence of terminal decision losses and detection delay
As c and R decrease in Problems 2.1 and 2.2 , respectively, the optimal stopping regions shrink and one should expect to wait longer. In Problem 2.1, when the unit sampling cost is small, one should take more advantage of it and sample more. In Problem 2.2, when the upper bounds on the TDLs are small, one expects to wait longer to collect more information in order to satisfy the constraints. On the other hand, the size of the stopping regions for (τ A , d A ) and (υ B , d B ) decrease monotonically as A and B decrease. Therefore, functions A(c) and B(R) should be monotonically decreasing as c and R decrease, respectively. We explore the asymptotic behaviors of the detection delay cost and the TDL as A ↓ 0 and B ↓ 0. Define Moreover, assume, while taking limits B ↓ 0, that the ratio B i /B i for every i ∈ M is bounded from below by some strictly positive number so that it is consistent with how R decreases to 0 as we assumed in (3.11).
We first obtain bounds on the TDLs that are shown to converge to zero in the limit. The LLR processes can be used as Radon-Nikodym derivatives to change measures as the following lemma shows. The proof only requires the change of measure and the same result holds more generally. For the proof, see, e.g., Lemma 2.3 of Dayanik et al. (2013).
The next proposition can be obtained by setting F := {d = i} ∈ F τ in Lemma 3.1.

Proposition 3.2 (Bounds on the TDL)
We can obtain the following bounds on the TDLs.
Using the bounds in Proposition 3.2 and Remark 2.6, we can obtain feasible strategies by choosing the values of A and B accordingly.
We now analyze the asymptotic behavior of the detection delay. The next remark allows us to use τ Its proof is the same as that of Proposition 3.6 of Dayanik et al. (2013).
The posterior probability process ( (i) n ) i∈M 0 has been shown to converge a.s. in Dayanik and Goulding (2009). Moreover, because the posterior probability of the correct hypothesis should tend to increase in the long run, on the event {μ = i}, i ∈ M, it is expected that (i) n converges to 1 and that ( j) n converges to 0 for every j ∈ M 0 \ {i} with probability one. This suggests the a.s.-convergence of n (i, j) to infinity given μ = i for every j ∈ M 0 \ {i}. For the rest of this section, we further assume that the average increment converges to some strictly positive value.
Assumption 3.2 For every i ∈ M, we assume that This is indeed satisfied in the i.i.d. case (Dayanik et al. 2013). For the case |M| = 1, stronger convergence results (for a more general hidden Markov setting) beyond Assumption 3.2 hold as shown by (Fuh and Tartakovsky 2018, Lemma 1). In Sect. 4, we will show that this is also satisfied in certain settings and that the limit can be expressed in terms of the Kullback-Leibler divergence.
Let us fix i ∈ M. We show that, for small values of A and B, the stopping times τ n /n ≈ l(i) for sufficiently large n as the next proposition implies.

Proposition 3.4 For every i
For the proof of Proposition 3.4 above, (ii) follows immediately by Assumption 3.2 and (i) follows from the next lemma after replacing Y ( j) n , P, and (μ j ) j∈M 0 \{i} in the lemma with n (i, j)/n, P i , and (l(i, j)) j∈M 0 \{i} , respectively, for every fixed i ∈ M.
n ) n≥1 be a sequence of random variables defined on a common probability space ( , E, P), and suppose that Y Lemma 3.2 is a straightforward extension of Lemma 5.2 of Baum and Veeravalli (1994) and hence its proof is omitted.
The following lemma can be derived from Proposition 3.4. The proof is the same as that of Lemma 3.9 of Dayanik et al. (2013).

Lemma 3.3
For every i ∈ M and any j(i) ∈ arg min j∈M 0 \{i} l(i, j), we have P i -a.s.
Remark 3.4 Without loss of generality, we shall assume that 0 < B i j < 1 (i.e. −∞ < log B i j < 0) for all i ∈ M and j ∈ M 0 \ {i} as we are interested in the limits of certain quantities as B ↓ 0. Recall also that the ratio B i /B i for every i ∈ M is bounded from below by some strictly positive number. Hence Here, the last equality follows from the first two equalities.
see, e.g., (Tijms 2003). Then This and the a.s. finiteness of θ together with Lemma 3.3 prove the next lemma.
Lemma 3.4 For every i ∈ M and any j(i) ∈ arg min j∈M 0 \{i} l(i, j), we have P i -a.s.
Because we want to minimize the m th moment of the detection delay time for any m ≥ 1, we will strengthen the convergence results of Lemma 3.3. We require Condition 3.1 below for some r ≥ m.
Condition 3.1 (Uniform Integrability) For given r ≥ 1, we assume that Because c(·) is bounded, this also implies the following. (i) Under Condition 3.1 (i) for some r ≥ 1, ( Hence, Condition 3.1 for some r ≥ m is sufficient for the L m -convergence.
Lemma 3.6 For every i ∈ M and m ≥ 1, we have the following.
(i) If Condition 3.1 (i) holds for some r ≥ m, then we have (3.17) (ii) If Condition 3.1 (ii) holds for some r ≥ m, then we have ( 3.18) Alternatively to Condition 3.1, it can be shown that the r -quick convergence is a sufficient condition as in (Baron and Tartakovsky 2006;Dragalin et al. 2000;Lai 1981;Tartakovsky and Veeravalli 2004b). However, here we obtain a weaker sufficient condition known as the r -complete convergence recently verified to act as a sufficient condition for a related problem in Tartakovsky (2017).
Proof We only prove (ii). The proof for (i) is similar and slightly simpler.
First, Fatou's lemma and Lemma 3.3 give the lower bound: and hence it suffices to obtain the upper bound. By following (A.5) of Tartakovsky (2017), we have a bound: for 0 < ε < l(i), Therefore, by (3.19) and the assumption, Because ε > 0 is arbitrary, we have the result.
Proof The claim (ii) holds by Lemma 3.7 because, for 0 < ε < min j∈M 0 \{i}:l(i, j)>l(i) (l(i, j) − l(i)) and n ≥ 1, On the other hand, as in the proof of Remark 3.16 of Dayanik et al. (2013), we have, for all ε > 0 and n > 2 log M/ε, and hence, for sufficiently small ε, Hence, (i) holds as well under C 1 and C 2 .
is also P iuniformly integrable and converge to (c i /l(i)) m , and hence we have This and Corollary 3.1 show (3.18). The proof of (3.17) is similar.

Asymptotic optimality
We now prove the asymptotic optimality of (τ A , d A ) and (υ B , d B ) for Problems 2.1 and 2.2 under Conditions 3.1 (i) and (ii), respectively. We first derive a lower bound on the expected detection delay under the optimal strategy (see Lemma 3.8). The lower bound on the expected detection delay under the optimal strategy can be obtained similarly to CPD and SMHT (see Baron and Tartakovsky 2006;Baum and Veeravalli 1994;Dragalin et al. 1999Dragalin et al. , 2000Lai 2000;Tartakovsky and Veeravalli 2004a). This lower bound and Lemma 3.6/Proposition 3.5 can be combined to obtain asymptotic optimality for both Problems 2.1 and 2.2 .

Lemma 3.8 For every i ∈ M and j(i) ∈ arg min j∈M
We now study how to set A in terms of c in order to achieve asymptotic optimality in Problem 2.1. We see from Proposition 3.2 and Lemma 3.6 that the TDLs decrease faster than the detection delay cost and are negligible when A and B are small. Indeed, we have, with c i := c m i , in view of the definition of the Bayes risk in (2.5), by Proposition 3.2 and Lemma 3.6, for every i ∈ M, (3.20) Following the same idea of Baron and Tartakovsky (2006) for the change detection problem, we choose the value of A i as the minimizer of the mapping over x ∈ (0, ∞). In other words, In particular, when m = 1, The proof of the following is similar to that of Proposition 3.18 of Dayanik et al. (2013) and is hence omitted.
Proposition 3.6 (Asymptotic optimality of (τ A , d A ) in Problem 2.1) Fix m ≥ 1 and a set of strictly positive constants a. Under Conditions 3.1 (i) or C 1 and C 2 of Corollary 3.1 for the given m, the strategy (τ A(c) , d A(c) ) is asymptotically optimal as c ↓ 0; that is, (3.14) holds for every i ∈ M.
We now show that the strategy (υ B , d B ) is asymptotically optimal for Problem 2.2. It follows from Proposition 3.3 that, if we set Assuming the conditions in Lemma 3.6 (ii) or Proposition 3.5 hold, because υ B(R) ≤ υ (i) B (R) and min y∈Y j(i) R yi ↓ 0 is equivalent to B i j(i) (R) ↓ 0, we have lim sup This together with Lemma 3.8 shows asymptotic optimality.

Proposition 3.7 (Asymptotic optimality of (υ B , d B ) in Problem 2.2) Fix m ≥ 1. Under Condition 3.1 (ii) or C 1 and C 2 of Corollary 3.1 for the given m, the strategy (υ B(R) , d B(R) )
is asymptotically optimal as R ↓ 0; that is, (3.15) holds for every i ∈ M.

Convergence results of LLR processes
In this section, we consider two particular cases where Assumption 3.2 holds with l(i, j) expressed in terms of the Kullback-Leibler divergence. We assume that X θ , X θ +1 , . . . are identically distributed on {μ = i} given θ , for every i ∈ M. For the purpose of determining the limit l(i, j), because each class is closed, we can assume without loss of generality that Y i consists of a single state, say, for every i ∈ M.

The conditional probability of the event that Y is absorbed by
We assume the following throughout this section.
Assumption 4.1 For every i ∈ M, we assume that exists and (i) ∈ (0, ∞]. Here, (i) = ∞ holds for example when P i {θ < M} = 1 for some M < ∞. On the other hand, we must have (i) > 0. To see this, because θ is the exit time from a set of transient states, using the facts on absorption probabilities (see e.g. Çınlar 2013, Chapters 5 and 6), it can be shown that In a special case where the change time is geometric with parameter p > 0 as in Dayanik et al. (2013), this is satisfied with (i) = | log(1− p)|. Assumption 4.1 also holds, for example, when θ is a mixture or a sum of geometric random variables; see the examples given in Sect. 5.1.

Example 1
Suppose that the distribution of X given Y is identical in the transient set Y 0 ; namely, f (y, ·) = f (z, ·) =: f 0 (·), y, z ∈ Y 0 . This models, for example, the case the change point θ is phase-type. See Example 1 of Dayanik and Goulding (2009).
We denote the Kullback-Leibler divergence of f i (·) from f j (·) by which always exists and is nonnegative. We assume f i (·) and f j (·) as in (4.1) for any i = j are distinguishable; namely, we assume the following.

Assumption 4.2 We assume
(4.5) To ensure that E log f 0 (x) f j (x) f i (x)m(dx) exists for every i ∈ M and j ∈ M 0 \ {i}, we further assume the following.
exists by Assumption 4.3. Here, we allow (4.6) to be +∞ but we assume the following.
We shall prove the following under Assumptions 4.1-4.4.
In order to show Proposition 4.1, we first simplify the LLR process as in (3.3). Define, for each j ∈ M, (4.8) Lemma 4.1 Fix i ∈ M. For any n ≥ 1, By this lemma, each LLR process admits a decomposition (4.9) Here notice that ( j) < ∞ for j ∈ M \ ( i ∪ {i}) by Remark 4.1(1). We explore the convergence for ( n l=1 h i j (X l ))/n and n (i, j)/n separately. For i ∈ M and j ∈ M 0 \ {i}, because θ is an a.s. finite random variable (so that (4.10) We now show that n (i, j)/n in (4.9) converges almost surely to zero.

Lemma 4.2 For every i ∈ M, we have the followings under
(4.12) By the characterization of n (i, j) in (4.9) and Lemma 4.2 (i)-(iii), This also holds when j = 0 because Indeed, the left-hand side of (4.13) equals by Assumption 4.1 and by Lemma 3.2, we have (4.13). This together with (4.10) shows Proposition 4.1.
The a.s. convergence can be extended to the L r (P i )-convergence for r ≥ 1 as well, under additional integrability conditions. Firstly, as in Lemma 4.3 of Dayanik et al. (2013), for every i ∈ M, j ∈ M 0 \ {i} and r ≥ 1, we have n −1 n l=1 h i j (X l ) (4.14) Here, (4.14) holds if the following condition holds.
On the other hand, by Lemma 4.2, n (i, j)/n → 0 as n ↑ ∞ in L r (P i ) under the following condition (Condition 4.2). Notice in Lemma 4.2 (vi) when j = i that in order for L (i) n /n to converge in L r (P i ) to zero, it is sufficient to have Condition 4.2 Given i ∈ M, j ∈ M \ {i} and r ≥ 1, we suppose that (4.11) and (4.15) hold, and, if j ∈ M \ i , (4.12) holds for the given r .
In summary, we have the following L r -convergence results.
Proposition 4.2 For every i ∈ M and j ∈ M 0 \{i}, we have n (i, j)/n → l(i, j) as n ↑ ∞ in L r (P i ) for some r ≥ 1 if Conditions 4.1 and 4.2 hold for the given r .

Example 2
As a variant of Example 1, we consider the case X is not necessarily identically distributed This can model the case when the distribution of X and θ depends on μ. See Sect. 5.1 for an example.
Because, given 0 and Y θ = i, the conditional probability of θ = t given {μ = i} as in (4.2) can be written

Assumption 4.5 For every
This ensures that q(i, j) > 0 and q (0) (i, j) > 0 where we use (4.4) and define We assume the following to ensure that E log Assumption 4.6 For every i, j ∈ M, we assume that q (0) (i, j) < ∞.
We shall show the following under Assumptions 4.1, 4.4, 4.5, and 4.6 .
As we did for Example 1 of Sect. 4.1, we simplify the LLR process as follows. Define we later show that n (i, 0)/n ∼ min j∈M (0) n (i, j)/n as n → ∞ under P i (see (4.20)).

Lemma 4.3 For i
and for i ∈ M and j ∈ M \ {i} As in Example 1, we decompose each LLR process for every i ∈ M such that By the SLLN and Assumption 4.1, for every i ∈ M, we have P i -a.s. as n ↑ ∞ (4.17) We now show that n (i, j)/n converges almost surely to zero as n → ∞. Similar to Lemma 4.2, the following holds.   if (4.18) holds and (4.19) By this lemma, for every i ∈ M, we have n (i, j)/n → 0 for j ∈ M\{i }, and n (i, j)/n → 0 for j ∈ M, as n ↑ ∞ P i -a.s. By this and (4.17), the proof of Proposition 4.3 is complete once we show that 1 n n (i, 0) (4.20) Indeed, s. Hence by Lemma 3.2, (4.20) holds.
We now pursue the convergence in the L r -sense. In view of (4.21), we have n (i, 0)/n ≤ n (i, j)/n for any j ∈ M and Therefore, for the proof of the uniform integrability of n (i, 0)/n, it is sufficient to show that of (0) n (i, j)/n for every j ∈ M. As in Example 1, for every i ∈ M and r ≥ 1, we have n −1 n l=1 h i j (X l ) which are satisfied under the following condition.

Condition 4.3 For given
which is satisfied if ( j) < ∞ and the following holds.
On the other hand, by Lemma 4.2, n (i, j)/n → 0 as n ↑ ∞ in L r (P i ) under Condition 4.5 given subsequently for j ∈ M \ {i}, and, for j = 0, (0) n (i, j)/n → 0 as n ↑ ∞ in L r (P i ) under Condition 4.6 given subsequently for j ∈ M. Notice as in Lemma 4.2 (vi) that in order for L (i) n to converge in L r under P i to zero, it is sufficient to have Condition 4.5 Given i ∈ M, j ∈ M \ {i} and r ≥ 1, we suppose that (4.23) holds, and 1. if j ∈ i , (4.18) holds, and 2. if j ∈ M \ i , (4.19) holds for the given r .

Condition 4.6
Given i ∈ M, we suppose that (4.23) holds and max j∈M ( j) < ∞ holds.
In summary, we have the following L r -convergence results.

Numerical examples
In this section, we verify the effectiveness of the asymptotically optimal strategies through a series of numerical experiments. Because the optimality results are fundamentally relying on the existence of the limits l(i, j) as in Assumption 3.2, we first verify their existence numerically and show that they can be obtained efficiently via simulation. We then evaluate the performance of the asymptotically optimal strategies in comparison to the optimal values.

Verification of Assumption 3.2
We consider both the case when X is i.i.d. in each of the closed sets as studied in Sect.4 and also the non-i.i.d. case where each closed set may contain multiple states. In order to verify the convergence results in Sect. 4, we consider Example 2 of Sect. 4.2 with M = 2 and the hidden Markov Under P 1 , Y starts at either (1, 1) or (1, 2) and gets absorbed by 1, while under P 2 it starts at either (2, 1) or (2, 2) and gets absorbed by 2. Conditionally given Y 0 = (1, 1), the absorption   time θ is a sum of two independent geometric random variables with parameters 0.15 and 0.1; conditionally on Y 0 = (1, 2), it is geometric with parameters 0.1. It is easy to show that the exponential tail (4.3) under P 1 is (1) = | log(1 − min(0.1, 0.15))|. On the other hand, 0 , the absorption time θ is a mixture of two geometric random variables with parameters 0.2 and 0.05. Its exponential tail is (2) = | log(1 − min(0.2, 0.05))|.
For the observation process X , we assume that it is normally distributed with a common variance 1 and its conditional mean given Y is {λ(y); y ∈ Y}. As is assumed in Example 2, we let λ (0) 1 := λ((1, 1)) = λ ((1, 2)) and λ (0) 2 := λ((2, 1)) = λ ((2, 2)). We also let λ k := λ(k) Here we assume that 2 = 0 and λ 2 = 0.2. Using Proposition 4.3, the analytical limit values l(i, j) are obtained and are listed in the last column of Table 1. In Fig. 3, we plot sample paths of n (1, ·)/n under P 1 and n (2, ·)/n under P 2 along with the theoretical limit l (i, j). In order to verify their almost sure convergence, we show in Table 1 the statistics on the position at time n = 500, 1000, 1500 based on 1000 samples for each. We indeed see that the mean value approaches the theoretical limit and the standard deviation diminishes as n increases, verifying the almost sure limit of the LLR processes.
We plot in Fig. 5 sample paths of the LLR processes n (1, ·)/n under P 1 and n (2, ·)/n under P 2 and also show in Table 2 the statistics on their positions at n = 500, 1000, 1500 based on 1000 sample paths. We observe that these processes indeed converge to deterministic limits almost surely. It is also noted that the convergence holds regardless of the cyclic/acyclic structure of the closed sets.

Numerical results on asymptotic optimality
We now evaluate the asymptotically optimal strategy in comparison with the optimal Bayes risk focusing on Problem 2.1 with m = 1. Dayanik and Goulding (2009) showed that the problem can be reduced to an optimal stopping problem of the posterior probability process , and in theory the value function can be approximated via value iteration in combination with discretization. In practice, however, the state space increases exponentially in the number of states |Y|, and it is computationally feasible only when |Y| is small (typically at most three or four). Moreover, we need to deal with small detection delay costs c and hence the resulting stopping regions tend to be very small in practical applications. For this reason, the approximation is affected severely by discretization errors as well. Here in order to provide reliable approximation to the optimal Bayes risk, we consider Case 1 has been considered in Dayanik and Goulding (2009) where θ is geometric with parameter .05 under P 1 and .1 under P 2 . In Case 2, it is a sum of two geometric random variables under P. See Fig. 6  (a) case 1 under P 1 (b) case 1 under P 2 (a) case 2 under P 1 (b) case 2 under P 2 Fig. 5 Sample realizations of LLR processes: a n → n (1, 0)/n (red) and n → n (1, 2)/n (blue) under P 1 and b n → n (2, 0)/n (red) and n → n (2, 1)/n (blue) under P 2 , along with the mean of 1500 /1500 given in Table 2   Table 2 The LLR process at time n = 500, 1000, 1500: mean and standard deviation 500 1000 1500 ( We set the detection delay function c = [0, 0,c,c] and the terminal decision loss function a yi = 1 for y / ∈ Y i and it is zero otherwise. The limits l(i, j) can be analytically computed by Propositions 4.1 and the asymptotically optimal strategy can be constructed analytically.
Here we have A i (c) =c/l(i), for every i ∈ M. In order to compute the optimal Bayes risk, we first discretize the state space of (which is |Y| − 1-simplex) by 70 |Y|−1 mesh and then obtain the stopping regions by solving the optimality equation provided in Dayanik and Goulding (2009) via value iteration. The optimal Bayes risk is then approximated via simulation based on 10, 000 paths. The risk under the asymptotically optimal strategy is approximated based on 100, 000 paths. Table 3 shows the results. It shows the approximated Bayes risk (with 95% confidence interval) for both strategies and also the ratio between the two. It can be seen that the ratio indeed converges to 1. In fact, the results show that the convergence is fast and it approximates the optimal Bayes risk precisely even for a moderate value ofc. The proposed strategy can be derived analytically and its corresponding Bayes risk can be computed instantaneously via simulation.
Data Availability Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Declarations
Conflict of interest On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Appendix A Proofs
A.1 Proof of Lemma 3.8 The proof of Lemma 3.8 requires the following lemmas. Note that the proof is similar to that of Theorem 3.5 of Baron and Tartakovsky (2006).
Moreover, we have As in the proof of Lemma A.1 of Dayanik et al. (2013), Combining the above and taking infimum over (R), Now, the lemma holds because (τ, d) ∈ (R) implies that R i (τ, d) ≤ y∈Y\Y i R yi /ν i and R ji (τ, d) ≤ y∈Y j R yi .