A Distance for HMMs Based on Aggregated Wasserstein Metric and State Registration

  • Yukun Chen
  • Jianbo Ye
  • Jia Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9910)


We propose a framework, named Aggregated Wasserstein, for computing a dissimilarity measure or distance between two Hidden Markov Models with state conditional distributions being Gaussian. For such HMMs, the marginal distribution at any time spot follows a Gaussian mixture distribution, a fact exploited to softly match, aka register, the states in two HMMs. We refer to such HMMs as Gaussian mixture model-HMM (GMM-HMM). The registration of states is inspired by the intrinsic relationship of optimal transport and the Wasserstein metric between distributions. Specifically, the components of the marginal GMMs are matched by solving an optimal transport problem where the cost between components is the Wasserstein metric for Gaussian distributions. The solution of the optimization problem is a fast approximation to the Wasserstein metric between two GMMs. The new Aggregated Wasserstein distance is a semi-metric and can be computed without generating Monte Carlo samples. It is invariant to relabeling or permutation of the states. This distance quantifies the dissimilarity of GMM-HMMs by measuring both the difference between the two marginal GMMs and the difference between the two transition matrices. Our new distance is tested on the tasks of retrieval and classification of time series. Experiments on both synthetic data and real data have demonstrated its advantages in terms of accuracy as well as efficiency in comparison with existing distances based on the Kullback-Leibler divergence.


Hidden Markov Model Gaussian Mixture Model Wasserstein distance 



This research is supported by the National Science Foundation under grant number ECCS-1462230.

Supplementary material

419981_1_En_27_MOESM1_ESM.pdf (2.8 mb)
Supplementary material 1 (pdf 2913 KB)


  1. 1.
    Baker, J.K.: The dragon system-an overview. IEEE Trans. Acoust. Speech Signal Process. 23(1), 24–29 (1975)CrossRefGoogle Scholar
  2. 2.
    Cole, R., Hirschman, L., Atlas, L., Beckman, M., Biermann, A., Bush, M., Clements, M., Cohen, J., Garcia, O., Hanson, B., et al.: The challenge of spoken language systems: research directions for the nineties. IEEE Trans. Speech Audio Process. 3(1), 1–21 (1995)CrossRefGoogle Scholar
  3. 3.
    Huang, X.D., Ariki, Y., Jack, M.A.: Hidden Markov Models for Speech Recognition, vol. 2004. Edinburgh University Press Edinburgh, Edinburgh (1990)Google Scholar
  4. 4.
    Nilsson, M., Ejnarsson, M.: Speech recognition using hidden markov model. Department of Telecommunications and Speech Processing, Blekinge Institute of Technology (2002)Google Scholar
  5. 5.
    Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK book, vol. 2. Entropic Cambridge Research Laboratory Cambridge (1997)Google Scholar
  6. 6.
    Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice-Hall Inc., Upper Saddle River (1993)Google Scholar
  7. 7.
    Ren, L., Patrick, A., Efros, A.A., Hodgins, J.K., Rehg, J.M.: A data-driven approach to quantifying natural human motion. ACM Trans. Graph. (TOG) 24(3), 1090–1097 (2005)CrossRefGoogle Scholar
  8. 8.
    Bregler, C.: Learning and recognizing human dynamics in video sequences. In: Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 568–574. IEEE (1997)Google Scholar
  9. 9.
    Shang, L., Chan, K.P.: Nonparametric discriminant hmm and application to facial expression recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2090–2096. IEEE (2009)Google Scholar
  10. 10.
    Alon, J., Sclaroff, S., Kollios, G., Pavlovic, V.: Discovering clusters in motion time-series data. In: Conference on Computer Vision and Pattern Recognition (CVPR). vol. 1, p. I-375. IEEE (2003)Google Scholar
  11. 11.
    Coviello, E., Chan, A.B., Lanckriet, G.R.: Clustering hidden Markov models with variational hem. J. Mach. Learn. Res. 15(1), 697–747 (2014)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Smyth, P., et al.: Clustering sequences with hidden Markov models. In: Advances in Neural Information Processing Systems (NIPS), pp. 648–654 (1997)Google Scholar
  13. 13.
    Levinson, S.E., Rabiner, L.R., Sondhi, M.M.: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst. Tech. J. 62(4), 1035–1074 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Juang, B.H.F., Rabiner, L.R.: A probabilistic distance measure for hidden Markov models. AT&T Tech. J. 64(2), 391–408 (1985)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Zhong, S., Ghosh, J.: A unified framework for model-based clustering. J. Mach. Learn. Res. 4, 1001–1037 (2003)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Jebara, T., Kondor, R., Howard, A.: Probability product kernels. J. Mach. Learn. Res. 5, 819–844 (2004)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Jebara, T., Song, Y., Thadani, K.: Spectral clustering and embedding with Hidden Markov Models. In: Kok, J.N., Koronacki, J., Mantaras, R.L., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 164–175. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-74958-5_18 CrossRefGoogle Scholar
  18. 18.
    Lv, F., Nevatia, R.: Recognition and segmentation of 3-D human action using HMM and Multi-class AdaBoost. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 359–372. Springer, Heidelberg (2006). doi: 10.1007/11744085_28 CrossRefGoogle Scholar
  19. 19.
    Lee, Y.K., Park, B.U.: Estimation of Kullback-Leibler divergence by local likelihood. Ann. Inst. Stat. Math. 58(2), 327–340 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Noh, Y.K., Sugiyama, M., Liu, S., du Plessis, M.C., Park, F.C., Lee, D.D.: Bias reduction and metric learning for nearest-neighbor estimation of Kullback-Leibler divergence. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 669–677 (2014)Google Scholar
  21. 21.
    Givens, C.R., Shortt, R.M., et al.: A class of Wasserstein metrics for probability distributions. Michigan Math. J. 31(2), 231–240 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Villani, C.: Topics in Optimal Transportation, vol. 58. American Mathematical Society, Providence (2003)Google Scholar
  23. 23.
    Ramdas, A., Garcia, N., Cuturi, M.: On Wasserstein two sample testing and related families of nonparametric tests. arXiv preprint arXiv:1509.02237 (2015)
  24. 24.
    Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems (NIPS), pp. 2292–2300 (2013)Google Scholar
  25. 25.
    Blanchard, G., Lee, G., Scott, C.: Semi-supervised novelty detection. J. Mach. Learn. Res. 11, 2973–3009 (2010)MathSciNetzbMATHGoogle Scholar
  26. 26.
    Scott, C.: A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In: International Conference on Artificial Intelligence and Statistics (AISTATS) (2015)Google Scholar
  27. 27.
    Akama, Y., Irie, K.: Vc dimension of ellipsoids. arXiv preprint (2011). arXiv:1109.4347
  28. 28.
    Ye, J., Wu, P., Wang, J.Z., Li, J.: Fast discrete distribution clustering using Wasserstein barycenter with sparse support, 1–14. arXiv preprint arXiv:1510.00012 (2015)

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.College of Information Sciences and TechnologyPennsylvania State UniversityUniversity ParkUSA
  2. 2.Department of StatisticsPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations