Hidden Markov Models with Confidence

Cherubin, Giovanni; Nouretdinov, Ilia

doi:10.1007/978-3-319-33395-3_10

Hidden Markov Models with Confidence

Giovanni Cherubin^17,18 &
Ilia Nouretdinov¹⁸

Conference paper
First Online: 17 April 2016

1964 Accesses
1 Citations
2 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9653))

Abstract

We consider the problem of training a Hidden Markov Model (HMM) from fully observable data and predicting the hidden states of an observed sequence. Our attention is focused to applications that require a list of potential sequences as a prediction. We propose a novel method based on Conformal Prediction (CP) that, for an arbitrary confidence level $1-\varepsilon $, produces a list of candidate sequences that contains the correct sequence of hidden states with probability at least $1-\varepsilon $. We present experimental results that confirm this holds in practice. We compare our method with the standard approach (i.e.: the use of Maximum Likelihood and the List–Viterbi algorithm), which suffers from violations to the assumed distribution. We discuss advantages and limitations of our method, and suggest future directions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
This paper will write CP implicitly indicating Smooth CP. The difference is that standard CP would guarantee $\varepsilon $ to be an upper bound of errors [9].

References

Forney Jr., G.D.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
Article MathSciNet Google Scholar
Gammerman, A., Vovk, V.: Hedging predictions in machine learning. Comput. J. 50(2), 151–163 (2007)
Article Google Scholar
Melluish, T., Saunders, C., Nouretdinov, I., Vovk, V.: The typicalness framework: a comparison with the Bayesian approach. University of London, Royal Holloway (2001)
Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Seshadri, N., Sundberg, C.W.: List Viterbi decoding algorithms with applications. IEEE Trans. Commun. 42(234), 313–323 (1994)
Article Google Scholar
Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res. 9, 371–421 (2008)
MathSciNet MATH Google Scholar
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)
Article MATH Google Scholar
Vovk, V., Fedorova, V., Nouretdinov, I., Gammerman, A.: Criteria of efficiency for conformal prediction. In: Gammerman, A., Luo, Z., Vega, J., Vovk, V. (eds.) COPA 2016. LNCS(LNAI), vol. 9653, pp. 23–39. Springer, Heidelberg (2016)
Google Scholar
Vovk, V., Gammerman, A., Shafer, G.: Algorithmic Learning in a Random World. Springer, New York (2005)
MATH Google Scholar

Download references

Acknowledgements

Giovanni Cherubin was supported by the EPSRC and the UK government as part of the Centre for Doctoral Training in Cyber Security at Royal Holloway, University of London (EP/K035584/1). This project has received funding from the European Unions Horizon 2020 Research and Innovation programme under Grant Agreement no. 671555 (ExCAPE). This work was also supported by EPSRC grant EP/K033344/1 (“Mining the Network Behaviour of Bots”); by Thales grant (“Development of automated methods for detection of anomalous behaviour”); by the National Natural Science Foundation of China (No.61128003) grant; and by the grant “Development of New Venn Prediction Methods for Osteoporosis Risk Assessment” from the Cyprus Research Promotion Foundation.

We are grateful to Alexander Gammerman, Kenneth Paterson, and Vladimir Vovk for useful discussions. We also would like to thank the anonymous reviewers for their insightful comments.

Author information

Authors and Affiliations

Information Security Group, Egham, UK
Giovanni Cherubin
Computer Science Department and Computer Learning Research Centre, Royal Holloway University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK
Giovanni Cherubin & Ilia Nouretdinov

Authors

Giovanni Cherubin
View author publications
You can also search for this author in PubMed Google Scholar
Ilia Nouretdinov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanni Cherubin .

Editor information

Editors and Affiliations

University of London, Egham, United Kingdom
Alexander Gammerman
University of London, Egham, United Kingdom
Zhiyuan Luo
CIEMAT, Madrid, Spain
Jesús Vega
University of London, Egham, United Kingdom
Vladimir Vovk

Appendices

A Validity of the Method

We are given a multiset (training set) of sequences $\{(x_i, h_i)\}$, for $i=1, 2, ..., n$. We select a significance level $\varepsilon \in [0,1]$. Let $x_{n+1}$ be a test sequence and $h_{n+1}$ the corresponding sequence of hidden states. Our method outputs a prediction set $\hat{H}=\{h_1, h_2, ...\}$. We show that the probability that $\hat{H}$ contains the correct sequence is at least $1-\varepsilon $.

Let us construct the following multiset:

$$\begin{aligned} Z_{train} = \{(x^{(j)}_i, h^{(j)}_i)\} \quad j=1,2,...\ell _i \quad i=1,2,...,n, \end{aligned}$$

where $\ell _i = |x_i| = |h_i|$.

Let $\ell =|x_{n+1}|=|h_{n+1}|$. Let us consider the j-th element of the sequence $x_{n+1}$. We assume exchangeability on the multiset

$$\begin{aligned} Z_{train} \cup \{(x^{(j)}_{n+1},h^{(j)}_{n+1})\}. \end{aligned}$$

We run:

$$\begin{aligned} \hat{H}_j = CP\left( x^{(j)}_{n+1}, Z_{train}, A, \frac{\varepsilon }{\ell }\right) , \end{aligned}$$

as defined in Algorithm 1. Thanks to the validity property of Smooth CP [9], the following holds:

$$\begin{aligned} P(h^{(j)}_{n+1} \notin \hat{H}_j) = \frac{\varepsilon }{\ell }. \end{aligned}$$

We repeat this for all the observations in $x_{n+1}$. We define $\hat{H}$ as the set of all the sequences of length $\ell $ that can be generated by using elements from $\hat{H}_1$ as a first element, elements from $\hat{H}_2$ as a second one, and so on. Then we can derive the probability of error of our method as the probability of the correct sequence $h_{n+1}$ of not being in the prediction set as:

$$\begin{aligned} {\begin{matrix} P(h_{n+1} \notin \hat{H}) &{} = P(h_{n+1}^{(1)}\notin \hat{H}_1 \vee h_{n+1}^{(2)}\notin \hat{H}_2 \vee ... \vee h_{n+1}^{(\ell )}\notin \hat{H}_\ell ) \\ &{} \le \sum _{j=1}^\ell P(h_{n+1}^{(j)}\notin \hat{H}_j) = \ell \frac{\varepsilon }{\ell } = \varepsilon \end{matrix}} \end{aligned}$$

$\qquad \blacksquare $

Follows that $1-\varepsilon $ is a lower–bound to the probability of error of the method.

B Datasets

1.1 B.1 HMM-NORM Dataset

We sampled 2000 sequences of length $\ell =10$. The sequences were generated by using a continuous HMM with 3 hidden states, $S=\{s_1,s_2,s_3\}$, start probabilities $\varPi = \{0.6, 0.3, 0.1\}$, transition probabilities:

$$\begin{aligned} A = \{\alpha _{ij}\} = \left( \begin{array}{ccc} 0.7 &{} 0.2 &{} 0.1 \\ 0.3 &{} 0.5 &{} 0.2 \\ 0.3 &{} 0.3 &{} 0.4 \end{array} \right) , \end{aligned}$$

and emission probabilities: $b_{o}{s_1} \sim \mathcal {N}(-2,0.7)$, $b_{o}{s_2} \sim \mathcal {N}(0,0.7)$, $b_{o}{s_3} \sim \mathcal {N}(2,0.7)$. Figure 6(a) graphically shows the distribution of $b_{o}{s_1}$, $b_{o}{s_2}$, and $b_{o}{s_3}$.

1.2 B.2 HMM-GMM Dataset

We sampled 2000 sequences of length $\ell =10$. The sequences were generated by using a continuous HMM with 3 hidden states, $S=\{s_1,s_2,s_3\}$, start probabilities $\varPi = \{0.6, 0.3, 0.1\}$, transition probabilities:

$$\begin{aligned} A = \{\alpha _{ij}\} = \left( \begin{array}{ccc} 0.7 &{} 0.2 &{} 0.1 \\ 0.3 &{} 0.5 &{} 0.2 \\ 0.3 &{} 0.3 &{} 0.4 \end{array} \right) . \end{aligned}$$

Emission probabilities where given by one mixture of two Normal distributions. Let $\mathcal {G}(\mu ,\sigma ,w)$ be a mixture of two Normal distribution with means $\mu =(\mu _1, \mu _2)$, standard deviations $\sigma =(\sigma _1, \sigma _2)$, and weights $w=(w_1, w_2)$. That is:

$$\begin{aligned} \mathcal {G}(\mu ,\sigma ,w) = \sum ^2_{i=1} w_i \mathcal {N}(\mu _i, \sigma _i). \end{aligned}$$

The model we used had emission probabilities: $b_{o}{s_1} \sim \mathcal {G}((0,2),(0.7,0.7),(0.7,0.3))$, $b_{o}{s_2} \sim \mathcal {G}((-2,-1),(0.25,0.25),(0.5,0.5))$, $b_{o}{s_3} \sim \mathcal {G}((2,3),(0.5,0.3),(0.7,0.3))$. Figure 6(b) graphically shows the distribution of $b_{o}{s_1}$, $b_{o}{s_2}$, and $b_{o}{s_3}$.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cherubin, G., Nouretdinov, I. (2016). Hidden Markov Models with Confidence. In: Gammerman, A., Luo, Z., Vega, J., Vovk, V. (eds) Conformal and Probabilistic Prediction with Applications. COPA 2016. Lecture Notes in Computer Science(), vol 9653. Springer, Cham. https://doi.org/10.1007/978-3-319-33395-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-33395-3_10
Published: 17 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-33394-6
Online ISBN: 978-3-319-33395-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics