A hybrid neural network hidden Markov model approach for automatic story segmentation

Original Research


We propose a hybrid neural network hidden Markov model (NN-HMM) approach for automatic story segmentation. A story is treated as an instance of an underlying topic (a hidden state) and words are generated from the distribution of the topic. The transition from one topic to another indicates a story boundary. Different from the traditional HMM approach, in which the emission probability of each state is calculated from a topic-dependent language model, we use deep neural network (DNN) to directly map the word distribution into topic posterior probabilities. DNN is known to be able to learn meaningful continuous features for words and hence has better discriminative and generalization capability than n-gram models. Specifically, we investigate three neural network structures: a feed-forward neural network, a recurrent neural network with long short-term memory cells (LSTM-RNN) and a modified LSTM-RNN with multi-task learning ability. Experimental results on the TDT2 corpus show that the proposed NN-HMM approach outperforms the traditional HMM approach significantly and achieves state-of-the-art performance in story segmentation.


Neural network Long short-term memory Hidden Markov model Multi-task learning Story segmentation Topic modeling 



This paper is supported by the National Natural Science Foundation of China (61571363), Aeronautical Science Foundation of China (20155553038 and 20155553040), Science and Technology on Avionics Integration Laboratory.


  1. Abdel-Hamid O, Deng L, Yu D (2013) Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Proceedings of INTERSPEECH, pp 3366–3370Google Scholar
  2. Banerjee S, Rudnicky AI (2006) A texttiling based approach to topic boundary detection in meetings. In: Proceedings of INTERSPEECHGoogle Scholar
  3. Beeferman D, Berger A, Lafferty J (1999) Statistical models for text segmentation. Mach Learn 34(1–3):177–210CrossRefMATHGoogle Scholar
  4. Blei DM, Moreno PJ (2001) Topic segmentation with an aspect hidden Markov model. In: Proceedings of SIGIR, pp 343–348Google Scholar
  5. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  6. Bouchekif A, Damnati G, Charlet D (2014) Intra-content term weighting for topic segmentation. In: Proceedings of ICASSP, pp 7113–7117Google Scholar
  7. Bourlard HA, Morgan N (2012) Connectionist speech recognition: a hybrid approach, vol 247. Springer, BerlinGoogle Scholar
  8. Chaisorn L, Chua TS, Lee CH (2003) A multi-modal approach to story segmentation for news video. World Wide Web Internet Web Inf Syst 6(2):187–208CrossRefGoogle Scholar
  9. Charlet D, Damnati G, Bouchekif A, Douib A (2015) Fusion of speaker and lexical information for topic segmentation: a co-segmentation approach. In: Proceedings of ICASSP, pp 5261–5265Google Scholar
  10. Chen H, Guo B, Yu Z, Han Q (2016) Toward real-time and cooperative mobile visual sensing and sharing. In: Proceedings of INFOCOM, pp 1–9Google Scholar
  11. Chen D, Manning CD (2014) A fast and accurate dependency parser using neural networks. In: EMNLP, pp 740–750Google Scholar
  12. Chunwijitra V, Chotimongkol A, Wutiwiwatchai C (2016) A hybrid input-type recurrent neural network for LVCSR language modeling. Eurasip J Audio Speech Music Process 1:15CrossRefGoogle Scholar
  13. Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of ICML, pp 160–167Google Scholar
  14. Cullar MP, Delgado M, Pegalajar MC (2005) An application of non-linear programming to train recurrent neural networks in time series prediction problems. In: Proceedings of ICEIS, pp 35–42Google Scholar
  15. Damavandi B, Kumar S, Shazeer N, Bruguier A (2016) Nn-grams: Unifying neural network and n-gram language models for speech recognition. In: Proceedings of INTERSPEECH, pp 3499–3503Google Scholar
  16. Eisenstein J, Barzilay R (2008) Bayesian unsupervised topic segmentation. In: Proceedings of EMNLP, pp 334–343Google Scholar
  17. Fiscus J, Doddington G, Garofolo J, Martin A (1999) NISTs 1998 topic detection and tracking evaluation (TDT2). In: Proceedings of the 1999 DARPA Broadcast News Workshop, pp 19–24Google Scholar
  18. Fragkou P, Petridis V, Kehagias A (2004) A dynamic programming algorithm for linear text segmentation. J Intell Inf Syst 23(2):179–197CrossRefMATHGoogle Scholar
  19. Ghosh S, Vinyals O, Strope B, Roy S, Dean T, Heck L (2016) Contextual LSTM (CLSTM) models for large scale NLP tasks. In: Proceedings of DL-KDDGoogle Scholar
  20. Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of ICASSP, pp 6645–6649Google Scholar
  21. Grezl F, Karafiat M, Vesely K (2014) Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: Proceedings of ICASSP, pp 7654–7658Google Scholar
  22. Haidar M, Kurimo M (2016) Recurrent neural network language model with incremental updated context information generated using bag-of-words representation. In: Proceedings of INTERSPEECHGoogle Scholar
  23. Hearst MA (1997) Texttiling: segmenting text into multi-paragraph subtopic passages. Comput Linguist 23(1):33–64Google Scholar
  24. Heinonen O (1998) Optimal multi-paragraph text segmentation by dynamic programming. In: Proceedings of ACL, pp 1484–1486Google Scholar
  25. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of SIGIR, pp 50–57Google Scholar
  26. Huang Z, Li J, Siniscalchi SM, Chen IF, Wu J, Lee CH (2015) Rapid adaptation for deep neural networks through multi-task learning. In: Proceedings of INTERSPEECH, pp 3625–3629Google Scholar
  27. James A (2002) Introduction to topic detection and tracking. Topic detection and tracking, pp 1–16Google Scholar
  28. Karypis G (2002) Cluto—a clustering toolkit. Tech. Rep, DTIC DocumentGoogle Scholar
  29. Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP, pp 1746–1751Google Scholar
  30. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of EMNLP, pp 388–395Google Scholar
  31. Kumar G, FD’Haro L (2015) Deep autoencoder topic model for short texts. In: Proceedings of IWESGoogle Scholar
  32. Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of national conference on artificial intelligence, pp 2267–2273Google Scholar
  33. Larochelle H, Lauly S (2012) A neural autoregressive topic model. In: Proceedings of NIPS, pp 2717–2725Google Scholar
  34. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of ICML, pp 1188–1196Google Scholar
  35. Lee L, Chen B (2005) Spoken document understanding and organization. Signal Process Mag IEEE 22(5):42–60CrossRefGoogle Scholar
  36. Li J, Cheng JH, Shi JY, Huang F (2012) Advances in computer science and information engineering. Springer, BerlinGoogle Scholar
  37. Liu X, Gao J, He X, Deng L, Duh K, Wang YY (2015) Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In: Proceedings of HLT, pp 912–921Google Scholar
  38. Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of SIGIR, pp 165–174Google Scholar
  39. Lu M, Leung CC, Xie L, Ma B, Li H (2011a) Probabilistic latent semantic analysis for broadcast news story segmentation. In: Proceedings of INTERSPEECH, pp 1109–1112Google Scholar
  40. Lu M, Zheng L, Leung CC, Xie L, Ma B, Li H (2011b) Broadcast news story segmentation using probabilistic latent semantic analysis and Laplacian eigenmaps. In: Proceedings of APSIPA, pp 356–360Google Scholar
  41. Malioutov I, Barzilay R (2006) Minimum cut model for spoken lecture segmentation. In: Proceedings of ACL, pp 25–32Google Scholar
  42. Malioutov I, Parkand A, Barzilay R, Glass J (2007) Making sense of sound: unsupervised topic segmentation over acoustic input. In: Proceedings of ACL, p 504Google Scholar
  43. Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv:1301.3781 (preprint)
  44. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119Google Scholar
  45. Rabiner LR, Juang BH (1986) An introduction to hidden Markov models. ASSP Mag IEEE 3(1):4–16CrossRefGoogle Scholar
  46. Rau LF, Jacobs PS, Zernik U (1989) Information extraction and text summarization using linguistic knowledge acquisition. Inf Process Manag 25(4):419–428CrossRefGoogle Scholar
  47. Reynar JC (1994) An automatic method of finding topic boundaries. In: Proceedings of ACL, pp 331–333Google Scholar
  48. Rosenberg A, Hirschberg J (2006) Story segmentation of broadcast news in English, Mandarin and Arabic. In: Proceedings of HLT, pp 125–128Google Scholar
  49. Schultz T, Waibel A (2001) Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Commun 35(1):31–51CrossRefMATHGoogle Scholar
  50. Seltzer M, Droppo J (2013) Multi-task learning in deep neural networks for improved phoneme recognition. In: Proceedings of ICASSP, pp 6965–6969Google Scholar
  51. Sherman M, Liu Y (2008) Using hidden Markov models for topic segmentation of meeting transcripts. In: Proceedings of SLT, pp 185–188Google Scholar
  52. Shriberg E, Stolcke A, Hakkani-Tur D, Tür G (2000) Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun 32(1–2):127–154CrossRefGoogle Scholar
  53. Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272CrossRefMATHGoogle Scholar
  54. Sundermeyer M, Schlter R, Ney H (2012) LSTM neural networks for language modeling. In: Proceedings of INTERSPEECH, pp 194–197Google Scholar
  55. Tan T, Qian Y, Yu D, Kundu S, Lu L, Sim KC, Xiao X, Zhang Y (2016) Speaker-aware training of LSTM-RNNS for acoustic modelling. In: Proceedings of ICASSP, pp 5280–5284Google Scholar
  56. Tian F, Gao B, He D, Liu TY (2016) Sentence level recurrent topic model: letting topics speak for themselves. arXiv:160402038 (preprint)
  57. Van Mulbregt P, Carp I, Gillick L, Lowe S, Yamron J (1998) Text segmentation and topic tracking on broadcast news via a hidden Markov model approach. In: Proceedings of ICSLPGoogle Scholar
  58. Wan L, Zhu L, Fergus R (2012) A hybrid neural network-latent topic model. In: Proceedings of AISTATS, vol 12, pp 1287–1294Google Scholar
  59. Werbos PJ (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Netw 1(4):339–356CrossRefGoogle Scholar
  60. Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2):270–280CrossRefGoogle Scholar
  61. Wu Z, Valentinibotinhao C, Watts O, King S (2015) Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: Proceedings of ICASSP, pp 4460–4464Google Scholar
  62. Xie L, Yang YL, Liu ZQ (2011) On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news. Inf Sci 181(13):2873–2891CrossRefGoogle Scholar
  63. Xie L, Zheng L, Liu Z, Zhang Y (2012) Laplacian eigenmaps for automatic story segmentation of broadcast news. Audio Speech Lang Proces IEEE Trans 20(1):276–289CrossRefGoogle Scholar
  64. Xu K, Xie L, Yao K (2016) Investigating LSTM for punctuation prediction. In: Proceedings of ISCSLP, pp 5280–5284Google Scholar
  65. Yamron JP, Carp I, Gillick L, Lowe S, van Mulbregt P (1998) A hidden Markov model approach to text segmentation and event tracking. In: Proceedings of ICASSP, pp 333–336Google Scholar
  66. Yang C, Xie L, Zhou X (2014) Unsupervised broadcast news story segmentation using distance dependent Chinese restaurant processes. In: Proceedings of ICASSP, pp 4062–4066Google Scholar
  67. Yu D, Deng L (2015) Automatic speech recognition—a deep learning approach. Springer, BerlinMATHGoogle Scholar
  68. Yu D, Seltzer ML, Li J, Huang JT, Seide F (2013) Feature learning in deep neural networks-studies on speech recognition tasks. arXiv:13013605 (preprint)
  69. Zhang Z, Wang L, Kai A, Yamada T, Li W, Iwahashi M (2015) Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J Audio Speech Music Process 1:12CrossRefGoogle Scholar
  70. Zhang Y, Chuangsuwanich E, Glass JR (2014) Extracting deep neural network bottleneck features using low-rank matrix factorization. In: Proceedings of ICASSP, pp 185–189Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer ScienceNorthwestern Polytechnical UniversityXi’anChina
  2. 2.School of Computer and Information Engineering, Luoyang Institute of Science and TechnologyLuoyangChina
  3. 3.Temasek Laboratories@NTUNanyang Technological UniversitySingaporeSingapore

Personalised recommendations