There are many use cases for the methodology described. For example, the results of the failure prediction for a fleet of wind turbines may be input to reinforcement learning (POMDP) to optimize the schedule and route of the maintenance crews on a wind farm .
We illustrate our approach by going step by step through a real world example. We start with historic observations like multi-sensor time-series data for individual assets. We use expectation maximization (EM) to find the maximum likelihood or maximum a posterior (MAP) estimates of the parameters of our model from Fig. 1. Typically, we use EM to solve larger problems with a few hundred sensors and more than 20 states. In our experience, EM is computationally more tractable for larger real world problems than Markov chain Monte Carlo (MCMC), or its modern variant, Hamilton Monte Carlo (HMC) ), or the even more expensive variational inference approaches.
When using EM, the number of states for the HMM and the number of distinct distributions that make up our degradation processes are treated as hyper-parameters, which we input into our model. Posterior predictive checks (PPC)  is then used to find the right set of hyper-parameters.
In Fig. 4 we see an example of a transition matrix of a 6-state model. The off diagonal elements show the probabilities of transitioning between the HMM states, whereas the diagonals represent probabilities of remaining in each state. The lower triangular part of the matrix is set to zero to enforce an absorbing Markov chain where states move only from left to right towards the failure state. The Figure shows how the first two states are composed of mixtures, i.e. mixtures of common distributions. The thickness of the lines between the pdfs and the transition matrix indicates how much each common distribution contributes to the pertinent state distribution. For example, state 1 consists mainly of distribution 4 and 6, while distribution 5 contributes mainly to state 2. This shows an example of how the observation distributions of the states are mixtures of some set of common (simpler) distributions shared across all the states of the HMM. This approach can be seen as a generalization of tied-mixture HMM , where the shared distributions are limited to be Gaussian, while we allow for hierarchical mixtures of all distributions of the exponential family. In our approach any topic, or archetype can a priori transition to any other archetype. Using a sparse Dirichlet prior on transition distributions we learn a meaningful dependence between archetypes through posterior inference .
As mentioned before, expert knowledge of the failure mechanism maybe incorporated in the model by enforcing constraints on the structure of the transition matrix.
Figure 2 shows an example, of how we infer the hidden states given the sequence of observations, in our case, a time-series of sensor data. We see how the prediction of the expected failure time changes with additionally revealed sensor data over time. Using a Bayesian model like the MMMM enables us to calculate a new posterior for each newly observed data point thus gaining statistical strength and better prediction accuracy. Calculating the posterior is simple and short. It could be even done on edge devices. Contrast the simple solving of Bayes formula with the approach of discriminative models (i.e. regression models), where one would have to use the whole historic data set recalculating an improved model to add newly observed time series data.
Further, there is no obvious way using LSTM to get to a better prediction based on more real-time data points, since the posterior distribution is hidden in the weights. Similar to regression models, LSTM typically requires to run new back-propagation to learn from additional real time values. Another advantage of HMM of mixtures vs. LSTM is, it captures the natural (hidden) groupings within the data. Each group represents different asset profiles, i.e. distinct degradation processes, and thus failure curves (top left insert of Fig. 5).
Figure 5 shows an example of degradation states evolving over time. The thickness of the lines between the states indicates the probability of transitioning from one state to another. All assets start in state A, the initial state. After a certain amount of time, they end up in the terminal failure state P. See Fig. 2 for an example how the states evolve over time (in this Figure the final state is called F). The other health-states (B to E, L to O, and F to J, for example) represent states of an asset as it progresses towards failure.
The data frequency and Markov chain evolution are decoupled allowing for real time data arriving at different rates.
Once the model is fit, one can calculate the survival curves for the different degradation profiles which gives a summarized view of how assets fail as a baseline. See right plot in Fig. 6. The Figure also shows how the model can be used to “infer” which degradation profile the asset belongs to as new data arrives (colored graph on the left in the middle). The left bottom plot shows the entropy of the model’s belief for a specific asset as more data is observed. Entropy, as a measure of uncertainty, decreases over time after more and more data points of the time-series have been reveled. The decreased entropy shows that after about 50 observations we can already be rather sure (entropy about 0.5) to which profile the asset at hand belongs. Thus the prediction for the profile and the life expectancy is rather reliable, after observing only a third of its life time (for this particular example of a degrading pump).
Having a measure of accuracy for failure prediction is very important for practitioners. Obviously, the traditional ROC curves are not a good choice since they do not capture the dynamic nature of our approach, i.e. recalculating new posteriors when new data points arrive. Typically, practitioners are facing trade-off questions. For example, what is the right point in time to replace a part. Replacing assets too early leads to unnecessary expenses. On average, parts are being replaced before they break. Running assets too long risks unforeseen down-time. To use such trade-offs as a measure for model quality is often more meaningful than ROC-type accuracy curves.
Risk tolerance is a typical constraint for operations managers. Using a trade-off diagram she can choose the model that predicts the longest operating hours given a certain risk level. Figure 7 shows the trade-off between risk of failure vs. operating hours (mean up-time). Typically, the model that produces the fewest false negatives, i.e. the steepest hockey stick failure curve, is the most efficient. To the operator, the onset of the hockey stick indicates the latest point in time for exchanging the asset, given a chosen risk level (12.5% in the pictured-example). Flat hockey stick failure curves, i.e. those with higher number of false positives, lead typically to reduced operating hours, since they indicate to the operator exchanging parts before their end of life-time. Sometimes, the steepest hokey stick curves come with less accuracy. The operator could mitigate reduced model accuracy by increasing spare parts inventory for example, thus still profiting from longer hours of asset operation. We see, ROC-type accuracy is not always the most important metric. The trade-off between failure rate and operating time can be more meaningful.
So far, we have shown in our example how we determine asset health evolving over time (Fig. 2). We are able to predict the degradation of individual assets by deriving profiles, which lead to different survival curves (Fig. 6).
Next, we have to derive actionable insights from the predictions by finding an optimal maintenance and resource allocation policy. We support the practitioner by determining the best action to take on an asset at any given moment, and assigning the right repair task to the right resource, i.e. who is to repair what and how.
Before we can make decisions about repairing individual assets we need to understand which part of the value function (Fig. 3) to use, depending on a given asset health-state. Figure 8 shows how the best action changes over time depending on the transition state probabilities and the pertinent value function (green, yellow or red). States of assets are not observed directly but our model can be used to infer the posterior distribution over the hidden states. For example, the asset of Fig. 8 has a low probability of failure around time point 200. According to the pertinent value function (policy) the best action is to “take no action”. Around time point 230 the asset has high probability of being in a failure state (brown), thus, the value function recommends a “replace” action as the optimal take at this point in time.
More quantitatively, not knowing the current (hidden) state we generalize a Markov decision process (MDP) to a partially observable MDP. Using POMDP we observe the state only indirectly relating it to the underlying hidden state probabilistically, Fig. 8. Being uncertain about the state of the asset, we introduce rewards R (e.g. costs to repair, to replace, or the cost of down time), a set of actions A, and transition probabilities between health-states T; for details see Eq. 2. The transition probabilities T and the initial probabilities \(\pi \) of the POMDP are the same as our HMM parameters since we have used the hidden states S of our HMM to model the degradation. The set of actions A are \(a_0\) = “Do Nothing”, \(a_1\) = “Repair”, and \(a_2\) = “Replace”, see Fig. 9.
We do not know the states, which are hidden. We only observe time dependent sensor data. From the observed sensor data we construct a belief, i.e. a posterior distribution over the states. From the belief we use the optimal policy, i.e. the solution of the POMDP, to find the optimal action to take, given the level of uncertainty. The POMDP solution is represented by a piecewise linear and convex value function calculated by the value iteration algorithm . Once the value function is computed, the best action to take for each asset at time t is determined by finding the action with the highest value given the current state probabilities at time t, as shown in Fig. 9.