Shannon Entropy Rate of Hidden Markov Processes

Hidden Markov chains are widely applied statistical models of stochastic processes, from fundamental physics and chemistry to finance, health, and artificial intelligence. The hidden Markov processes they generate are notoriously complicated, however, even if the chain is finite state: no finite expression for their Shannon entropy rate exists, as the set of their predictive features is generically infinite. As such, to date one cannot make general statements about how random they are nor how structured. Here, we address the first part of this challenge by showing how to efficiently and accurately calculate their entropy rates. We also show how this method gives the minimal set of infinite predictive features. A sequel addresses the challenge's second part on structure.


I. INTRODUCTION
Randomness is as necessary to physics as determinism. Indeed, since Henri Poincaré's failed attempt to establish the orderliness of planetary motion, it has been understood that both determinism and randomness are essential and unavoidable in the study of physical systems [1][2][3][4]. In the 1960s and 1970s, the rise of dynamical systems theory and the exploration of statistical physics of critical phenomena offered up new perspectives on this duality. The lesson was that intricate structures in a system's state space amplify uncertainty, guiding it and eventually installing it-paradoxically-in complex spatiotemporal patterns. Accepting this state of affairs prompts basic, but as-yet unanswered questions. How is this emergence monitored? How do we measure a system's randomness or quantify its patterns and their organization?
The tools needed to address these questions arose over recent decades during the integration of Turing's computation theory [5][6][7], Shannon's information theory [8], and Kolmogorov's dynamical systems theory [9][10][11][12][13]. This established the vital role that information plays in physical theories of complex systems. In particular, the application of hidden Markov chains to model and analyze the randomness and structure of physical systems has seen considerable success, not only in complex systems [14], but also in coding theory [15], stochastic processes [16], stochastic thermodynamics [17], speech recognition [18], computational biology [19,20], epidemiology [21], and finance [22], to offer a nonexhaustive list of examples.
A highly useful property of certain hidden Markov chains (HMCs) is unifilarity [23], a structural constraint * amjurgens@ucdavis.edu † chaos@ucdavis.edu on their state transitions. Shannon showed that given a process generated by a finite-state unifilar HMC, one may directly and accurately calculate a process' irreducible randomness [8]-now called the Shannon entropy rate.
Furthermore, for such a process, there is a unique minimal finite-state unifilar HMC that generates the process [24], known as the -machine. The -machine states-the process' causal states-are the minimal set of maximally predictive features. One consequence of the -machine's uniqueness and minimality is that its mathematical description gives a constructive definition of a process' structural complexity as the amount of memory required to generate the process. Loosening the unifilar constraint to consider a wider class of generated processes, however, leads to major roadblocks. Predicting a process generated by a finitestate nonunifilar HMC requires an infinite set of causal states [25]. That is, though "finitely" generated, the process cannot be predicted by any finite unifilar HMC. Practically, this precludes directly determining the process' entropy rate using Shannon's result and, at best, obscures any insight into its internal structure.
That said, its causal states are (in general, see Appendix B) equivalent to the uncountable set of mixed states, or predictive features, formally introduced by Blackwell over a half century ago [26]. To date, working with infinite mixed-states required coarse-graining to produce a finite set of predictive features. Fortunately, the tradeoffs between resource constraints and predictive power induced by such coarse graining can be systematically laid out [27][28][29].
The following introduces an alternative and more direct approach to working with mixed states, though. It casts generating mixed states as a chaotic dynamical system-specifically, a (place dependent) iterated function system (IFS). This obviates analyzing the underlying HMC via coarse graining. Rather, the complex dynamics of the new system directly captures the information-theoretic properties of the original process.
Specifically, this allows exactly calculating the entropy rate of the process generated by the original nonunifilar finite-state HMC. Additionally, the IFS interpretation of the nonunifilar HMC provides new insight into the structure and complexity of infinite-state processes. This has direct application to the study of randomness and structure in a wide range of physical systems.
In point of fact, the following and its sequel [30] were proceeded by two companions that applied the theoretical results here to two, rather different, physical domains. The first analyzed the origin of randomness and structural complexity engendered by quantum measurement [31]. The second solved a longstanding problem on exactly determining the thermodynamic functioning of Maxwellian demons, aka information engines [32]. That is, the following and its sequel lay out the mathematical and algorithmic tools required to successfully analyze these applied problems. We believe the new approach is destined to find even wider applications.
Section II recalls the necessary background in stochastic processes, hidden Markov chains, and information theory. Section III reviews the needed results on iterated function systems; while Sec. IV develops mixed states and their dynamic-the mixed-state presentation. The main result connecting these then follows in Sec. V, showing that the mixed-state presentation is an IFS and that it produces an ergodic process. Section VI recalls Blackwell's theory, updating it for our present purpose of determining the entropy rate of any HMC. The Supplementary Materials provide background on the asymptotic equipartition property and minimality of the mixed states. They also constructively work through the results for several example nonunifilar HMCs. They close with the statistical error analysis underlying entropy-rate estimation.

II. HIDDEN MARKOV PROCESSES
A stochastic process P is a probability measure over a bi-infinite chain . . . X t−2 X t−1 X t X t+1 X t+2 . . . of random variables, each denoted by a capital letter. A particular realization . .
We assume values x t belong to a discrete alphabet A. We work with blocks X t:t , where the first index is inclusive and the second exclusive: X t:t = X t . . . X t −1 . P's measure is defined via the collection of distributions over blocks: To simplify the development, we restrict to stationary, ergodic processes: those for which Pr(X t:t+ ) = Pr(X 0: ) for all t ∈ Z, ∈ Z + . In such cases, we only need to consider a process's length-word distributions Pr(X 0: ).
A Markov process is one for which Pr(X t |X −∞:t ) = Pr(X t |X t−1 ). A hidden Markov process is the output of a memoryless channel [33] whose input is a Markov process   [16]. Working with processes directly is cumbersome, so we turn to consider finitely-specified mechanistic models that generate them.
The corresponding overall state-to-state transitions are described by the row-stochastic matrix T = x∈A T (x) .
Any given stochastic process can be generated by any number of HMCs. These are called a process' presentations.
We now introduce a structural property of HMCs that has important consequences in characterizing process randomness and structure.

Definition 2.
A unifilar HMC (uHMC) is an HMC such that for each state σ i ∈ S and each symbol x ∈ A there is at most one outgoing edge from state σ i labeled with symbol x.
Although there are many presentations for a process P, there is a canonical presentation that is unique: a process' -machine.

Definition 3.
An -machine is a uHMC with probabilistically distinct states: For each pair of distinct states σ i , σ j ∈ S there exists a finite word w = x 0: −1 such that: A process' -machine is its optimal, minimal presentation, in the sense that the set of predictive states |S| is minimal compared to all its other unifilar presentations [34].

A. Entropy Rate of HMCs
A process' intrinsic randomness is the information in the present measurement, discounted by having observed the information in an infinitely long history. It is measured by Shannon's source entropy rate [8].

Definition 4.
A process' entropy rate h µ is the asymptotic average entropy per symbol [35]: where H[X 0: ] is the Shannon entropy of block X 0: : Given a finite-state unifilar presentation M u of a process P, we may directly calculate the entropy rate from the transition matrices of the uHMC [8]: Blackwell showed, though, that in general for processes generated by HMCs there is no closed-form expression for the entropy rate [26]. For a process generated by an nonunifilar HMC M , applying Eq. (3) to M typically overestimates the true entropy rate of the process h µ (P): Overcoming this limitation is one of our central results.
We now embark on introducing the necessary tools for this.

III. ITERATED FUNCTION SYSTEMS
To get there, we must take a short detour to review iterated function systems (IFSs) [36], as they play a critical role in analyzing HMCs. Speaking simply, we show that HMCs are dynamical systems-namely, IFSs.
Let (∆ N , d) be a compact metric space with d(·, ·) a distance. This notation anticipates our later application, in which ∆ N is N -simplex of discrete-event probability distributions (see Section IV A). However, the results here are general.
Let f (x) : ∆ N → ∆ N for x = 1, . . . , k be a set of Lipschitz functions with: for all η, ζ ∈ ∆ N and where τ (x) is a constant. This notation is chosen to draw an explicit parallel to the stochastic processes discussed in Section II and to avoid confusion with the lowercase Latin characters used for realizations of stochastic processes. In particular, note that the superscript (x) here and elsewhere parallels that of the HMC symbol-labeled transition matrices T (x) . The reasons for this will soon become clear.
If each map f (x) is a contraction-i.e., τ (x) < 1 for all η, ζ ∈ ∆ N -it is well known that there exists a unique nonempty compact set Λ ⊂ ∆ N that is invariant under the IFS's action: Λ is the IFS's attractor.
A Borel probability measure µ is said to be invariant or for all η, ζ ∈ ∆ N . Assume that the modulus of uniform continuity of each p (x) satisfies Dini's condition and that there exists a δ > 0 such that: for all η, ζ ∈ ∆ N . Then there is an attractive, unique, invariant probability measure for the Markov process generated by the place-dependent IFS.
In addition, under these same conditions Ref. [38] established an ergodic theorem for IFS orbits. That is, for any η ∈ ∆ N and g : ∆ N → ∆ N :

IV. MIXED-STATE PRESENTATION
We now return to stochastic processes and their HMC presentations. When calculating entropy rates from various presentations, we noted that HMC presentations led to difficulties: (i) the internal Markov-chain entropy-rate overestimates the process' entropy rate and (ii) there is no closed-form entropy-rate expression. To develop the tools needed to resolve these problems, we introduce HMC mixed states and their dynamic.
Assume that an observer has a finite HMC presentation M for a process P. Since the process is hidden, the observer does not directly measure M 's internal states. Absent output data, the best guess for M 's hidden states is that they occur according to the state stationary distribution π. The observer can improve on this guess by monitoring the output data x 0 x 1 x 2 . . . that M generates. Given knowledge of M , determining the internal state from observed data is the problem of observer-process synchronization.

A. Mixed States
For a length-word w generated by M let η(w) = Pr(S|w) be the observer's belief distribution as to the process' current state after observing w: When observing a N -state machine, the vector η(w)| lives in the (N-1)-simplex ∆ N −1 , the set such that: The set of belief distributions η(w) that an HMC can visit defines its set of mixed states: Generically, the mixed-state set R for an N -state HMC is infinite, even for finite N [26].
Note that when a mixed state appears in probability expressions, the notation refers to the random variable η, not the row vector |η , and we drop the bra-ket notation. Bra-ket notation is used in vector-matrix expressions.

B. Mixed-State Dynamic
The probability of transitioning from η(w)| to η(wx)| on observing symbol x follows from Eq. (7) immediately; we have: This defines the mixed-state transition dynamic W. Together the mixed states and their dynamic define an HMC that is unifilar by construction. This is a process' mixed-state presentation (MSP) U(P) = {R, W}.
We defined a process' U abstractly. The U typically has an uncountably infinite set of mixed states, making it challenging to work with in the form laid out in Section IV A. Usefully, however, given any HMC M that generates the process, we may explicitly write down the dynamic W. Assume we have an N + 1-state HMC presentation M with k symbols x ∈ A. The initial condition is the invariant probability π over the states of M , so that η 0 | = δ π |. In the context of the mixed-state dynamic, mixed-state subscripts denote time.
The probability of generating symbol x when in mixed state η is: where T (x) is the symbol-labeled transition matrix associated with the symbol x.
From η 0 , we calculate the probability of seeing each x ∈ A. Upon seeing symbol x, the current mixed state η t | is updated according to: Thus, given an HMC presentation we can restate Eq. (7) as: Equation (9) tells us that, by construction, the MSP is unifilar, since each possible output symbol uniquely determines the next (mixed) state. Taken together, Eqs. (8) and (9) define the mixed-state transition dynamic W as: To find the MSP U = {R, W} for a given HMC M we apply the mixed-state construction method: The full set of mixed states seen from all allowed words. In this case, we recover the unifilar HMC shown in (A) as the MSP's recurrent states.
2. Calculate M 's invariant state distribution: π = πT . 3. Take η 0 to be δ π | and add it to R. 4. For each current mixed state η t ∈ R, use Eq. (8) to calculate Pr(x|η t ) for each x ∈ A. 5. For η t ∈ R, use Eq. (9) to find the updated mixed state η t+1,x for each x ∈ A. 6. Add η t 's transitions to W and each η t+1,x to R, merging duplicate states. 7. For each new η t+1 , repeat steps 4-6 until no new mixed states are produced.
With the MSP U(M ) in hand, the next issue is determining it's (equivalent) -machine. There are several cases.
Beginning with a finite, unifilar HMC M generating a process P, the MSP U(M ) is a finite, optimallypredictive rival presentation to P's -machine, as seen in Fig. 2. In this case, the starting HMC depicted in Fig. 2 (A) is an -machine, and reducing the MSP in In general, if U(M ) is finite, we find the -machine by minimizing U(M ) via merging duplicate states: repeat mixed-state construction on U(M ) and trim transient states once more. Minimizing countably-infinite and uncountably-infinite U(M ) is discussed further in Appendix B.
The MSPs of unifilar presentations are interesting and contain additional information beyond the unifilar presentations. For example, containing transient causal states, they are employed in calculating many complexity measures that track convergence statistics [39].
However, here we focus on the mixed-state presentations of nonunifilar HMCs, which typically have an infinite mixed-state set R. Figure 3 illustrates applying mixed-state construction to a finite, nonunifilar HMC. This produces an infinite sequence of mixed states on ∆ 1 = [0, 1], as plotted in Fig. 3(B). In this particular example, the MSP is highly structured and R is countably infinite, allowing us to better understand the underlying process P; compared, say, to the 2-state nonunifilar HMC in Fig. 3(A). MSPs of nonunifilar HMCs typically have an uncountably-infinite mixed-state set R.

V. MSP AS AN IFS
With this setup, our intentions in reviewing iterated function systems (IFSs) become explicit. The mixed-state presentation (MSP) exactly defines a placedependent IFS, where the mapping functions are the set of symbol-labeled mixed-state update functions as given in Eq. (9) and the set of place-dependent probability functions are given by Eq. (8). We then have a mapping function and associated probability function for each symbol x ∈ A that can be derived from the symbollabeled transition matrix T (x) .
If these probability and mapping functions meet the conditions of Theorem 1, we identify the attractor Λ as the set of mixed states R and the invariant measure µ as the invariant distribution π of the potentially infinitestate U. This is the original HMC's Blackwell measure. Since all Lipschitz continuous functions are Dini continuous, the probability functions meet the conditions by inspection. We now establish that the maps are contractions, by appealing to Birkhoff's 1957 proof that a positive linear map preserving a convex cone is a contraction under the Hilbert projection metric [40].
Given an integer N ≥ 2, let C N be the nonnegative cone in R N , so that C N consists of all vectors z = (z 1 , z 2 , . . . , z N ) satisfying z = 0 and z i ≥ 0 for all i.
. . . The projective distance d : for z, y ∈ C N , where d(z, z) = 0. If one of the points is on the cone boundary, the distance is taken to be +∞. Note that the projective distance, by construction, defines d(αz, βy) = d(z, y), where α, β ∈ R + . In other words, for two mixed states η, y) for every z, y ∈ C N such that d(y, z) > 0. We define the projective contractivity τ (x) associated with T (x) as: As the theorem below indicates, this inequality is strict.
By inspection we see that φ(H) > 0 and τ < 1. As Ref. [42] notes, not only does the projective metric turn all positive linear transformations into contraction mappings, it is the only metric that does so.
Positivity of the transition matrix guarantees that any boundary points are mapped inside ∆ N . This is not generally true for our transition matrices-they are restricted merely to be nonnegative. However, the above result extends to any nonnegative matrix T (x) for which there exists an N ∈ N + such that T (x) N is a positive matrix. Then there will be a τ (x) < 1 such that . This is equivalent to a requirement that T (x) be aperiodic and irreducible. Still, we are not guaranteed irreducibility and aperiodicity for our symbol-labeled transition matrices. Indeed, the Simple Nonunifilar Source, depicted in Fig. 4, has the symbol-labeled transition matrices: Both T ( ) and T ( ) are reducible. A quick check is to examine Fig. 4 and ask if there is a length-n sequence consisting of only a single symbol that reaches every state from every other state. Nonetheless, the HMC has a countable set of mixed states R and an invariant measure µ.
We can determine this from the mapping functions: From any initial state η 0 , other than η 0 = σ 0 = [1, 0], the probability of seeing a is positive. Once a is emitted, the mixed state is guaranteed to be η = σ 0 = [1,0]. In this case, when the mapping function is con-stant and the contractivity is −∞, we call the symbol a synchronizing symbol. From σ 0 , the set of mixed states is generated by repeated emissions of s, so that R = f ( ) n (σ 0 ) : n = 0, . . . , ∞ . This is visually depicted in Fig. 3 for the specific case of p = q = 1/2. For all p and q, the measure can be determined analytically; see Ref. [43]. Note that this is due to the HMC's highly structured topology. In general, the set of mixed states is uncountable-either a fractal or continuous set-and the measure cannot be analytically expressed.
Assuming the HMC generates an ergodic process ensures that the total transition matrix T = x T (x) is nonnegative, irreducible, and aperiodic. Define for any word w = x 1 . . . x ∈ A + the associated mapping function T (w) = T (x1) • · · · • T (x ) . Consider word w in a process' typical set of realizations (see Appendix A), which set approaches measure one as |w| → ∞. Due to ergodicity, it must be the case that f (w) is either (i) a constant mapping-and, therefore, infinitely contracting-or (ii) T (w) is irreducible.
As an example of the former case, we see that any composition of the SNS functions Eq. (13) is always a constant function, so long as there is at least one in the word, the probability of which approaches one as the word grows in length.
As an example of the later case, imagine adding to the SNS in Fig. 4 a transition on from σ 0 to σ 1 . Then, both symbol-labeled transition matrices are still reducible, but the composite transition matrices for any word including both symbols is now irreducible. Therefore, the map is contracting. While this is not the case for words composed of all s and all s, these sequences are measure zero as N → ∞. Appendix A discusses this further.

VI. ENTROPY OF GENERAL HMCS
Blackwell analyzed the entropy of functions of finitestate Markov chains [26]. With a shift in notation, functions of Markov chains can be identified as general hidden Markov chains. This is to say, both presentation classes generate the same class of stochastic processes. As we have discussed, the entropy rate problem for unifilar hidden Markov chains is solved, with Shannon's entropy rate expression, Eq. (3). However, according to Blackwell, there is no analogous closed-form expression for the entropy rate of a nonunifilar HMC.

A. Blackwell Entropy Rate
That said, Blackwell gave an expression for the entropy rate of general HMCs, by introducing mixed states over stationary, ergodic, finite-state chains. (Although he does not refer to them as such.) His main result, retaining his notation, is transcribed here and adapted by us to constructively solve the HMC entropy-rate problem.  m(i, j) . Let Φ be a function defined on 1, . . . , I with values a = 1, . . . , A and let y n = Φ(x n ). The entropy of the {y n } process is given by: where Q is a probability distribution on the Borel sets of the set W of vectors w = (w 1 , . . . , w I ) with w i ≥ 0, where W a consists of all w ∈ W with w i = 0 for Φ(i) = a and satisfies: where f a maps W into W a , with the jth coordinate of f a (w) given by i w i m(i, j)/r a (w) for Φ(j) = a.
We can identify the w vectors in Theorem 3 as exactly the mixed states of Section IV. Furthermore, it is clear by inspection that r a (w) and f a (w) are the probability and mapping functions of Eqs. (8) and (9), respectively, with a playing the role of our observed symbol x.
Therefore, Blackwell's expression Eq. (14) for the HMC entropy rate, in effect, replaces the average over a finite set S of unifilar states in Shannon's entropy rate formula Eq. (3) with (i) the mixed states R and (ii) an integral over the Blackwell measure µ. In our notation, we write Blackwell's entropy formula as: Thus, as with Shannon's original expression, this too uses unifilar states-now, though, states from the mixed-state presentation U. This, in turn, maintains the finite-to-one internal (mixed-) state sequence to observed-sequence mapping. Therefore, one can identify the mixed-state entropy rate itself as the process' entropy rate.

B. Calculating the Blackwell HMC Entropy
Appealing to Ref. [38], we have that contractivity of our substochastic transition matrix mappings guarantees ergodicity over the words generated by the mixed-state presentation. And so, we can replace Eq. (16)'s integral over R with a time average over a mixed-state trajectory η 0 , η 1 , . . . determined by a long allowed word, using Eqs. (8) and (9). This gives a new limit expression for the HMC entropy rate: where η = η(w 0: ) and w 0: is the first symbols of an arbitrarily long sequence w 0:∞ generated by the process.
Note that w 0: will be a typical trajectory, if is sufficiently long. To remove convergence-slowing contributions from transient mixed states, one can ignore some number of the initial mixed states. The exact number of transient states that should be ignored is unknown in general. That said, it depends on the initial mixed state η 0 , which is generally taken to be δ π |, and the diameter of the attractor.
This completes our development of the HMC entropy rate. Appendix C applies the theory and associated algorithm to a number of examples, with both countable and uncountable mixed states, and reveals a number of surprising properties. We now turn to practical issues of the resources needed for accurate estimation.

C. Data Requirements
Although we developed our HMC entropy-rate expression in terms of IFSs, determining a process' entropy rate can be recast as Markov chain Monte Carlo (MCMC) estimation. In MCMC, the mean of a function f (x) of interest over a desired probability distribution π(x) is estimated by designing a Markov chain with a stationary distribution π. For HMCs the desired distribution is the Blackwell measure µ, which is the stationary distribution µ over the MSP states R. Then, the Markov chain is simply the transition dynamic W over R.
With this setting, we estimate the entropy rate h µ B as the mean of the stochastic process defined by taking the entropy H[X η ] over symbols emitted from state η for a sequence of mixed states generated by W. In effect, we estimate the entropy rate as the mean of this stochastic process: Mathematically, little has changed. The advantage, though, of this alternative description is that it invokes the extensive body of results on MCMC estimation. In this, it is well known that there are two fundamental sources of error in the estimation. First, there is that due to initialization bias or undesired statistical trends introduced by the initial transient data produced by the Markov chain before it reaches the desired stationary distribution. Second, there are errors induced by autocorrelation in equilibrium. That is, the samples produced by the Markov chain are correlated. And, the consequence is that statistical error cannot be estimated by 1/ √ N , as done for N independent samples.
To address these two sources of error, we follow common MCMC practice, considering two "time scales" that arise during estimation. Consider the autocorrelation of the stationary stochastic process: where µ f is f 's mean. Also, consider the normalized autocorrelation, defined: If the autocorrelation decays exponentially with time, we define the exponential autocorrelation time: So, τ exp upper bounds the rate of convergence from an initial nonequilibrium distribution to the equilibrium distribution.
For a given observable, we also define the integrated autocorrelation time τ init,f as: This relates the correlated samples selected by the chain to the variance of independent samples for the particular function f of interest. The variance of f (x)'s sample mean in MCMC is higher by a factor of 2τ int,f . In other words, the errors for a sample of length N are of order τ int,f /N . Thus, targeting 1% accuracy requires ≈ 10 4 τ int,f samples.
In practice, it is difficult to find τ exp and τ int for a generic Markov chain. There are two options. The first is to use numerical approximations that estimate the autocorrelation function, and therefore τ , from data. If we have the nonunifilar model in hand, it is a simple matter of sweeping through increasingly long strings of generated data until we observe convergence of the autocorrelation function.
Alternatively, taking inspiration from previous treatments of nonunifilar models, we make a finite-state approximation to the MSP by coarse-graining the simplex into boxes of length and employ a suitable method, such as Ulam's method, to approximate the transition operator. Using methods previously discussed in Ref. [44], this allows calculating the autocorrelation function directly. Appendix D shows that that the approximation error vanishes as → 0.
The net result is that, being cognizant of the data requirements, entropy rate estimation is well behaved, convergent, and accurate.

VII. CONCLUSION
We opened this development considering the role that determinism and randomness play in the behavior of complex physical systems. A central challenge in this has been quantifying randomness, patterns, and structure and doing so in a mathematically-consistent but calculable manner. For well over a half a century Shannon entropy rate has stood as the standard by which to quantify randomness in a time series. Until now, however, calculating it for processes generated by nonunifilar HMCs has been difficult, at best.
We began our analysis of this problem by recalling that, in general, hidden Markov chains that are not unifilar have no closed-form expression for the Shannon entropy rate of the processes they generate. Despite this, these HMCs can be unifilarized by calculating the mixed states. The resulting mixed-state presentations are themselves HMCs that generate the process. However, adopting a unifilar presentation comes at a heavy cost: Generically, they are infinite state and so Shannon's expression cannot be used. Nonetheless, we showed how to work constructively with these mixed-state presentations. In particular, we showed that they fall into a common class of dynamical system. The mixed-state presentation is an iterated function system. Due to this, a number of results from dynamical systems theory can be applied.
Specifically, analyzing the IFS dynamics associated with a finite-state nonunfilar HMC allows one to extract useful properties of the original process. For instance, we can easily find the entropy rate of the generated process from long orbits of the IFS. That is, one may select any arbitrary starting point in the mixed-state simplex and calculate the entropy over the IFS's place-dependent probability distribution. We evolve the mixed state according to the IFS and sequentially sample the entropy of the place-dependent probability distribution at each step. Using an arbitrarily long word and taking the mean of these entropies, the method converges on the process' entropy rate.
Although others consider the IFS-HMC connection [45,46], our development expanded previous work to include the much broader, more general class of nonunifilar HMCs. In addition, we demonstrated not only the mixed-state presentation's role in calculating the entropy rate, but also its connection to existing approaches to randomness and structure in complex systems. In particular, while our results focused on quantifying and calculating a process' randomness, we left open questions of pattern and structure. However, the path to achieving the results introduced here strongly suggests that the mixed-state presentation offers insight into answering these questions. For instance, Fig. 3 demonstrated how the highly structured nature of the Simple Nonunifilar Source is made topologically explicit through calculating its mixed-state presentation-which is also its -machine.
Though space will not let us develop it further here, this connection is not spurious. Indeed, many information-theoretic properties of the underlying process may be directly extracted from its mixed-state presentation. This follows from our showing how the attractor of the IFS defined by an HMC is exactly the set of mixed states R of that HMC. These sets are often fractal in nature and quite visually striking. See Fig. S6 for several examples.
The sequel [30] to this development establishes that the fractal dimension of the mixed-state attractor is exactly the divergence rate of the statistical complexity [24]-a measure of a process' structural complexity that tracks memory. Furthermore, the sequel introduces a method to calculate the fractal dimension of the mixed-state attractor from the Lyapunov spectrum of the mixed-state IFS. In this way, it demonstrates that coarsegraining the simplex-the previous approach to study the structure of infinite-state processes-may be avoided altogether.
To close, we note that these structural tools and the entropy-rate method introduced here have already been put to practical use in two previous works. One diagnosed the origin of randomness and structural complexity in quantum measurement [31]. The other exactly determined the thermodynamic functioning of Maxwellian information engines [32], when there had been no previous method for this. At this point, however, we must leave the full explication of these techniques and further analysis on how mixed states reveal the underlying structure of processes generated by hidden Markov chains to the sequel [30].

Supplementary Materials
The Shannon Entropy Rate of Hidden Markov Processes Alexandra Jurgens and James P. Crutchfield

arXiv:2002.XXXXX
The Supplementary Materials to follow review the notion of typical sets of realizations in a stochastic process, discuss minimality of infinite-state mixed-state presentations, determine the entropy rates of a suite of example hidden Markov chains with infinite mixed-state presentations, and give details of errors that arise when estimating autocorrelation.

Appendix A: Asymptotic Equipartition and the Typical Set Contraction
The asymptotic equipartition property (AEP) states that for a discrete-time, ergodic, stationary process X: as n → ∞ [33]. This effectively divides the set of sequences into two sets: the typical set-sequences for which the AEP holds-and the atypical set, for which it does not. As a consequence of the AEP, it must be the case that the typical set is measure one in the space of all allowed realizations and all sequences in the atypical set approach measure zero as n → ∞.
We argue that while our IFS class includes reducible maps, any composition of maps corresponding to a word in the typical set will be irreducible. This can be seen intuitively by considering the SNS, shown in Fig. 4, and adding an additional transition on a from σ 0 to σ 1 . This produces an HMC with two reducible symbol-labeled transition matrices, but an irreducible total transition matrix. However, as |w| → ∞, the only words such that T (w) remains reducible are N and N . We can see that these words cannot possibly be in the typical set, since − 1 n log 2 Pr( n ) = − log 2 Pr( ) = h µ (X). The entropy rate h µ is by definition the branching entropy averaged over the mixed states. And so, any word that visits only a restricted subset of the mixed states-i.e., a word with a reducible transition matrix-cannot approach h µ , regardless of length. Therefore, only words with an irreducible mapping will be in the typical set, implying that there exists an integer word length |w| > 0 for which words without a contractive mapping are measure zero.

Appendix B: Minimality of U(M )
The minimality of infinite-state mixed-state presentations U(M ) is an open question. As demonstrated in Appendix C 1, it is possible to construct MSPs with an uncountably infinite number of states for a process that requires only one state.
A proposed solution to this problem is a short and simple check on mergeablility of mixed states, which here refers to any two distinct mixed states that have the same conditional probability distribution over future strings; i.e., any two mixed states η 0 and ζ 0 for which: Pr(X 0:L |η 0 ) = Pr(X 0:L |ζ 0 ) , for all L ∈ N + .
Although minimality does not impact the entropy-rate calculation, one benefit of the IFS formalization of the MSP is the ability to directly check for duplicated states and therefore determine if the MSP is nonminimal. We check this by considering, for an N + 1 state machine M with alphabet A = {0, 1, . . . , k}, the dynamic not only over mixed states, but probability distributions over symbols. Let: and consider Fig. S1. For each mixed state η ∈ ∆ N , Eq. (S2) gives the corresponding probability distribution ρ(η) ∈ ∆ k over the symbols x ∈ A. Let M emit symbol x, then the dynamic from one such probability distribution ρ ∈ ∆ k to the next is given by: From this, we see that if Eq. (S2) is invertible, g (x) : ∆ k → ∆ k is well defined and has the same functional properties as f (x) . In other words, in this case, it is not possible to have two distinct mixed states η, ζ ∈ ∆ N with the same probability distribution over symbols. And, the probability distributions can only converge under the action of g (x) if the mixed states also converge under the action of f (x) . Shortly, we consider several cases where P is not invertible over the entire symbol simplex.
If every mixed state in R corresponds to a unique probability distribution over symbols, we conjecture that the corresponding U(M ) is the minimal unifilar representation of the underlying process P. If we then trim the transient states of U(M ), leaving the recurrent set R R , the result is the -machine. The Cantor set is uncountably infinite and has Hausdorff dimension: A parametrized family of Cantor sets is generated by repeating , removing the middle s−2 s ), the Hausdorff dimension is: Simply stated, the dimension is the logarithm of the number of copies of the original unit interval made at each iteration, divided by the logarithm of the length ratio between the original object and its copy.

b. The Cantor Machine
The Cantor set, due to its familiarity, makes for a useful, first object of study for uncountable-mixed-state HMCs. Figure S2 shows an HMC M C that generates a Cantor set of mixed states. There 0 < a < 1 adjusts the statistical bias of the measure over the Cantor set and s > 2 is the scaling ratio for copying the intervals.
From Fig. S2 we read off the transition matrices: This allows us to immediately write down the probability functions and mapping functions, recalling that in the two-state case the vectors on the simplex take the form η| = ( η |δ 1 , 1 − η |δ 1 ): and: It is easily seen, by considering η |δ 1 = 0 and η |δ 1 = 1, that these maps, in fact, map the simplex to the first and second intervals of C 2 , respectively.
The Cantor Machine MSP U(M C ) is shown Fig. S3 (Top). It has an uncountably-infinite number of recurrent states, which correspond exactly to the elements of the Cantor set. Since the probability functions do not depend on η|, we do not need to invoke the Ergodic Theorem, but instead can calculate the entropy exactly:

c. A Biased Coin
However, there is an important caveat here, noted in Appendix B. The MSP may contain states that are probabilistically equivalent. The probability mapping functions are noninvertible and, in fact, every single mixed state corresponds to the same conditional probability distribution over symbols. This means that the uncountably-infinite MSP is not a minimal presentation. There is a markedly simpler unifilar model for the Cantor set machine M C .
In fact, all mixed states in U(M C ) collapse into a single state, giving the minimal unifilar model of the Cantor set machine as the Biased Coin HMC shown in Fig. S3 (Bottom). This HMC generates the same process as the Cantor machine, but requires only a single state.

Countable MSP with 2-state HMC
Now, we explore a different, but related case that introduces a condition for a countable MSP and again highlights the role of minimality. These give the mapping and probability functions: Consider the probability functions first. P is not invertible over all of ∆ 2 , but is partially invertible over a restricted domain. Given a line in the simplex where η 2 and η 3 are a function of η 1 -say, (η 1 , 1−η1 2 , 1−η1 2 )-we can invert P (η). The question becomes: What is the appropriate restricted domain?
Note that for both f ( ) and f ( ) , η 1 = η 2 . In the simplex this corresponds to all the mixed states lying along a line in ∆ 2 -the line (η 1 , η 1 , 1 − 2η 1 ). This, then, is the restricted domain over which the states U(M 3S ) correspond to unique probability distributions. The fact that this space is a line implies that the generative machine can be written with only two states.
The constancy of the mapping function for contributes further structure, ensuring that the set of mixed states will be countably infinite. We can write the mixed states down in series, in terms of how many s we have seen since the last : where n = 0 is taken to be the mixed state η( 0 ) = (0, 0, 1). The transition probabilities for these states are: The MSP U(M 3S ) is shown in Fig. S5. If the initial condition η 0 = (0, 0, 1), all mixed states generated by the mapping functions are recurrent and, as we discussed, have unique probability distributions. Therefore, the HMC in Fig. S5 is the process' -machine. Since R is countable, we can find µ by hand, by solving the set of equations π n+1 = π n Pr( |η n ), with n π n = 1. This gives π n = 2 5 2 1−n − 3 −n and we find:  S5. Unifilar HMC that generates the same process generated by the nonunfilar machine M3S in Fig. S4.

b. Actually, A 2-State Machine
As mentioned above, the restricted domain over which P is invertible implies a smaller state set for the process generated by the nonunifilar machine M 3S . For all relevant mixed states, Pr(σ 1 ) = Pr(σ 2 ), suggesting that we devise an HMC combining the two states. However, the mapping function for must still project definitively to a single state, to retain the countable infinity of mixed states. In fact, these restrictions ensure that the minimal nonunifilar HMC for the process is the HMC for the Simple Nonunifilar Source, discussed in Section V.
If we declare thatη( 0 ) = (0, 1), we may calculate the subsequent sequence of mixed states associated with emitting an increasingly long sequence of s, by using the mapping functions in Section V. The next two states are: For the underlying process to remain the same, the condition that must be met is P (η( n )) =P (η( n )). This determines p and q. For n = 0 this is trivially met. For n = 1 we have: so that 1 − q = p 3 . Substituting this into theη( 2 ) condition we get: Substituting this into the probability distribution constraint for n = 2 gives q = 1/2 or q = 1/3, corresponding to two different 2-state nonunifilar HMCs that generate the same process as the 3-state HMC. This further emphasizes the lack of uniqueness of generative models. That said, by examining the underlying IFS, their HMCs can be recovered.

Parametrized HMCs and Their MSPs
Finally, consider an HMC with 3 symbols and 3 states: with β = 1−α 2 and y = 1 − 2x. From inspection, we see that α can take on any value from 0 to 1 and x may range from 0 to 1 2 .
FIG. S6. Parametrized 3-state HMC defined in Eq. (S1) that generates MSPs in a variety of structures, depending on x and α. However, due to the rotational symmetry in the transition matrices, the attractor is radially symmetric around the simplex center.
Choosing α = 0.6 and sweeping x ∈ [0, 0.5] gives us an MSP that first fills nearly the entire simplex, with probability mass concentrated at the corners, then shrinks to a finite machine with 3 states at x = 1/3, and finally grows once again into a fractal measure, as Fig. S6 illustrates. To demonstrate the ease and efficiency calculating their entropy rates Fig. S7 plots h µ as function of (x, α) ∈ [0, 0.5] × [0, 1]. It is an interesting side note that despite the wildly different structures on display in Fig. S6, we see a smoothly varying entropy rate that does not appear to be strongly affected by the underlying structure. This case and the expected impact of structure on the entropy rate more broadly will be discussed in further detail in the sequel.

Appendix D: Estimation Errors for Finite-State Autocorrelation
Coarse-graining the mixed-state simplex into a set C of boxes of width , we may construct a finite-state approximation of the infinite-state MSP. It has been shown that given such an approximation, for any given box c, the bound on the difference in the entropy rate over the symbol distribution between the coarse-grained approximation and a mixed state within that box is bounded by: where H b (·) is the binary entropy function [47]. Our task here is to consider the error in the autocorrelation in the sequence of mixed states since, if we can show that this is bounded, the error in the autocorrelation of the branching entropy must also be bounded.
At time zero, the autocorrelation is equal to A(L = 0) = X 0 X 0 , so for the finite-state approximation, we have: where π C is the stationary distribution over the coarse-grained mixed states, π C (i) is the stationary probability of cell i, and c i is the center of cell i. For the true process, we have: where dµ(η|i) is the distribution over mixed states within cell i. The maximum distance between any two mixed states in a cell i is bounded by: the length of the longest diagonal in a hypercube of dimension |G|, by construction. Since the gradient of the L 2 norm is simply x 2 = x/ x 2 , we have a bound on the difference in the autocorrelation at time zero: With increasing length we have: and: Let η = c i + δ for some mixed state in cell i. Then we can write: Now, note that: and: where λ w is the leading Lyapunov exponent of the mapping function. Substituting this and eliminating terms of order δ 2 gives us: These terms identify three sources of approximation error: (i) that due to a difference in the probability distribution over symbols, (ii) that in the mapping functions, and (iii) that from approximating the points at the center of their cells.
For the first, we note that total variation in the probability distribution over symbols is bounded by the distance between the mixed states at which the distributions are computed. So, for any two mixed states in the same cell, Pr(X = x|η) − Pr(X = x|ζ) T V ≤ √ G . Then, the first term is the error due to the difference in the expectation value of the next state, given that we have calculated the probability distribution at c i + δ, rather than c i . Using Hölder's inequality, for two distributions over P (X) and Q(X), we may say: where 1/p + 1/q = 1. Setting q = 1: So, after taking the product with the cell centers c i , we have that the first error is bounded by N √ G at all lengths. For the second, we note that since the maps are contractions, λ < 0, and the distance between f x (η) and f x (ζ), where η and ζ are in the same cell i, is bounded by √ G . As the length of a word w grows, λ w → −∞ and the distance f w (η) − f w (ζ) → 0. At large L, this term vanishes, at a rate equal to the average maximal Lyapunov exponent of the IFS.
The final error is that in the autocorrelation in the cell approximation which is, likewise, bounded by the cell size-this is the same error from A(0), viz. N √ G . And so, in combination with the bound on the entropy, we may say, loosely speaking, that the error in the autocorrelation vanishes as → 0. Therefore, to find τ and estimate the error in Eq. (18) as a function of sample size, we take finer coarse-grained approximations until convergence in the autocorrelation curve is observed, and then calculate τ directly.