Skip to main content

Hidden Markov Models with Confidence

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9653))

Abstract

We consider the problem of training a Hidden Markov Model (HMM) from fully observable data and predicting the hidden states of an observed sequence. Our attention is focused to applications that require a list of potential sequences as a prediction. We propose a novel method based on Conformal Prediction (CP) that, for an arbitrary confidence level \(1-\varepsilon \), produces a list of candidate sequences that contains the correct sequence of hidden states with probability at least \(1-\varepsilon \). We present experimental results that confirm this holds in practice. We compare our method with the standard approach (i.e.: the use of Maximum Likelihood and the List–Viterbi algorithm), which suffers from violations to the assumed distribution. We discuss advantages and limitations of our method, and suggest future directions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This paper will write CP implicitly indicating Smooth CP. The difference is that standard CP would guarantee \(\varepsilon \) to be an upper bound of errors [9].

References

  1. Forney Jr., G.D.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)

    Article  MathSciNet  Google Scholar 

  2. Gammerman, A., Vovk, V.: Hedging predictions in machine learning. Comput. J. 50(2), 151–163 (2007)

    Article  Google Scholar 

  3. Melluish, T., Saunders, C., Nouretdinov, I., Vovk, V.: The typicalness framework: a comparison with the Bayesian approach. University of London, Royal Holloway (2001)

    Google Scholar 

  4. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  5. Seshadri, N., Sundberg, C.W.: List Viterbi decoding algorithms with applications. IEEE Trans. Commun. 42(234), 313–323 (1994)

    Article  Google Scholar 

  6. Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res. 9, 371–421 (2008)

    MathSciNet  MATH  Google Scholar 

  7. Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)

    Article  MATH  Google Scholar 

  8. Vovk, V., Fedorova, V., Nouretdinov, I., Gammerman, A.: Criteria of efficiency for conformal prediction. In: Gammerman, A., Luo, Z., Vega, J., Vovk, V. (eds.) COPA 2016. LNCS(LNAI), vol. 9653, pp. 23–39. Springer, Heidelberg (2016)

    Google Scholar 

  9. Vovk, V., Gammerman, A., Shafer, G.: Algorithmic Learning in a Random World. Springer, New York (2005)

    MATH  Google Scholar 

Download references

Acknowledgements

Giovanni Cherubin was supported by the EPSRC and the UK government as part of the Centre for Doctoral Training in Cyber Security at Royal Holloway, University of London (EP/K035584/1). This project has received funding from the European Unions Horizon 2020 Research and Innovation programme under Grant Agreement no. 671555 (ExCAPE). This work was also supported by EPSRC grant EP/K033344/1 (“Mining the Network Behaviour of Bots”); by Thales grant (“Development of automated methods for detection of anomalous behaviour”); by the National Natural Science Foundation of China (No.61128003) grant; and by the grant “Development of New Venn Prediction Methods for Osteoporosis Risk Assessment” from the Cyprus Research Promotion Foundation.

We are grateful to Alexander Gammerman, Kenneth Paterson, and Vladimir Vovk for useful discussions. We also would like to thank the anonymous reviewers for their insightful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanni Cherubin .

Editor information

Editors and Affiliations

Appendices

A Validity of the Method

We are given a multiset (training set) of sequences \(\{(x_i, h_i)\}\), for \(i=1, 2, ..., n\). We select a significance level \(\varepsilon \in [0,1]\). Let \(x_{n+1}\) be a test sequence and \(h_{n+1}\) the corresponding sequence of hidden states. Our method outputs a prediction set \(\hat{H}=\{h_1, h_2, ...\}\). We show that the probability that \(\hat{H}\) contains the correct sequence is at least \(1-\varepsilon \).

Let us construct the following multiset:

$$\begin{aligned} Z_{train} = \{(x^{(j)}_i, h^{(j)}_i)\} \quad j=1,2,...\ell _i \quad i=1,2,...,n, \end{aligned}$$

where \(\ell _i = |x_i| = |h_i|\).

Let \(\ell =|x_{n+1}|=|h_{n+1}|\). Let us consider the j-th element of the sequence \(x_{n+1}\). We assume exchangeability on the multiset

$$\begin{aligned} Z_{train} \cup \{(x^{(j)}_{n+1},h^{(j)}_{n+1})\}. \end{aligned}$$

We run:

$$\begin{aligned} \hat{H}_j = CP\left( x^{(j)}_{n+1}, Z_{train}, A, \frac{\varepsilon }{\ell }\right) , \end{aligned}$$

as defined in Algorithm 1. Thanks to the validity property of Smooth CP [9], the following holds:

$$\begin{aligned} P(h^{(j)}_{n+1} \notin \hat{H}_j) = \frac{\varepsilon }{\ell }. \end{aligned}$$

We repeat this for all the observations in \(x_{n+1}\). We define \(\hat{H}\) as the set of all the sequences of length \(\ell \) that can be generated by using elements from \(\hat{H}_1\) as a first element, elements from \(\hat{H}_2\) as a second one, and so on. Then we can derive the probability of error of our method as the probability of the correct sequence \(h_{n+1}\) of not being in the prediction set as:

$$\begin{aligned} {\begin{matrix} P(h_{n+1} \notin \hat{H}) &{} = P(h_{n+1}^{(1)}\notin \hat{H}_1 \vee h_{n+1}^{(2)}\notin \hat{H}_2 \vee ... \vee h_{n+1}^{(\ell )}\notin \hat{H}_\ell ) \\ &{} \le \sum _{j=1}^\ell P(h_{n+1}^{(j)}\notin \hat{H}_j) = \ell \frac{\varepsilon }{\ell } = \varepsilon \end{matrix}} \end{aligned}$$

\(\qquad \blacksquare \)

Follows that \(1-\varepsilon \) is a lower–bound to the probability of error of the method.

Fig. 6.
figure 6

Distribution of the emission probabilities for the three hidden states in HMM-NORM (left–hand figure), and in HMM-GMM (right–hand figure).

B Datasets

1.1 B.1 HMM-NORM Dataset

We sampled 2000 sequences of length \(\ell =10\). The sequences were generated by using a continuous HMM with 3 hidden states, \(S=\{s_1,s_2,s_3\}\), start probabilities \(\varPi = \{0.6, 0.3, 0.1\}\), transition probabilities:

$$\begin{aligned} A = \{\alpha _{ij}\} = \left( \begin{array}{ccc} 0.7 &{} 0.2 &{} 0.1 \\ 0.3 &{} 0.5 &{} 0.2 \\ 0.3 &{} 0.3 &{} 0.4 \end{array} \right) , \end{aligned}$$

and emission probabilities: \(b_{o}{s_1} \sim \mathcal {N}(-2,0.7)\), \(b_{o}{s_2} \sim \mathcal {N}(0,0.7)\), \(b_{o}{s_3} \sim \mathcal {N}(2,0.7)\). Figure 6(a) graphically shows the distribution of \(b_{o}{s_1}\), \(b_{o}{s_2}\), and \(b_{o}{s_3}\).

1.2 B.2 HMM-GMM Dataset

We sampled 2000 sequences of length \(\ell =10\). The sequences were generated by using a continuous HMM with 3 hidden states, \(S=\{s_1,s_2,s_3\}\), start probabilities \(\varPi = \{0.6, 0.3, 0.1\}\), transition probabilities:

$$\begin{aligned} A = \{\alpha _{ij}\} = \left( \begin{array}{ccc} 0.7 &{} 0.2 &{} 0.1 \\ 0.3 &{} 0.5 &{} 0.2 \\ 0.3 &{} 0.3 &{} 0.4 \end{array} \right) . \end{aligned}$$

Emission probabilities where given by one mixture of two Normal distributions. Let \(\mathcal {G}(\mu ,\sigma ,w)\) be a mixture of two Normal distribution with means \(\mu =(\mu _1, \mu _2)\), standard deviations \(\sigma =(\sigma _1, \sigma _2)\), and weights \(w=(w_1, w_2)\). That is:

$$\begin{aligned} \mathcal {G}(\mu ,\sigma ,w) = \sum ^2_{i=1} w_i \mathcal {N}(\mu _i, \sigma _i). \end{aligned}$$

The model we used had emission probabilities: \(b_{o}{s_1} \sim \mathcal {G}((0,2),(0.7,0.7),(0.7,0.3))\), \(b_{o}{s_2} \sim \mathcal {G}((-2,-1),(0.25,0.25),(0.5,0.5))\), \(b_{o}{s_3} \sim \mathcal {G}((2,3),(0.5,0.3),(0.7,0.3))\). Figure 6(b) graphically shows the distribution of \(b_{o}{s_1}\), \(b_{o}{s_2}\), and \(b_{o}{s_3}\).

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Cherubin, G., Nouretdinov, I. (2016). Hidden Markov Models with Confidence. In: Gammerman, A., Luo, Z., Vega, J., Vovk, V. (eds) Conformal and Probabilistic Prediction with Applications. COPA 2016. Lecture Notes in Computer Science(), vol 9653. Springer, Cham. https://doi.org/10.1007/978-3-319-33395-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-33395-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-33394-6

  • Online ISBN: 978-3-319-33395-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics