Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

  • Tiehang DuanEmail author
  • Qi Lou
  • Sargur N. Srihari
  • Xiaohui Xie
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11441)


Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: (1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); (2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters.


  1. 1.
    Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). Scholar
  2. 2.
    Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1(1), 121–143 (2006). Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Cai, D., He, X., Han, J.: SRDA: an efficient algorithm for large-scale discriminant analysis. IEEE Trans. Knowl. Data Eng. 20(1), 1–12 (2008)CrossRefGoogle Scholar
  5. 5.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014, pp. 1724–1734. Association for Computational Linguistics, Doha, Qatar, October 2014.
  6. 6.
    Duan, T., Pinto, J.P., Xie, X.: Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures. Bioinformatics p. bty702 (2018).
  7. 7.
    Duan, T., Srihari, S.N.: Pseudo boosted deep belief network. In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 105–112. Springer, Cham (2016). Scholar
  8. 8.
    Duan, T., Srihari, S.N.: Layerwise interweaving convolutional LSTM. In: Mouhoub, M., Langlais, P. (eds.) AI 2017. LNCS, vol. 10233, pp. 272–277. Springer, Heidelberg (2017). Scholar
  9. 9.
    Gomez, J.C., Moens, M.F.: PCA document reconstruction for email classification. Comput. Stat. Data Anal. 56(3), 741–751 (2012)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Gu, Y., Chen, S., Marsic, I.: Deep multimodal learning for emotion recognition in spoken language. CoRR abs/1802.08332 (2018)Google Scholar
  11. 11.
    Gu, Y., Li, X., Chen, S., Zhang, J., Marsic, I.: Speech intention classification with multimodal deep learning. In: Mouhoub, M., Langlais, P. (eds.) AI 2017. LNCS (LNAI), vol. 10233, pp. 260–271. Springer, Cham (2017). Scholar
  12. 12.
    Hori, C., Hori, T., Lee, T., Sumi, K., Hershey, J.R., Marks, T.K.: Attention-based multimodal fusion for video description. CoRR abs/1701.03126 (2017)Google Scholar
  13. 13.
    Hotho, A., Staab, S., Maedche, A.: Ontology-based text clustering. In: Proceedings of the IJCAI 2001 Workshop Text Learning: Beyond Supervision (2001)Google Scholar
  14. 14.
    Huang, R., Yu, G., Wang, Z.: Dirichlet process mixture model for document clustering with feature partition. IEEE Trans. Knowl. Data Eng. 25(8), 1748–1759 (2013)CrossRefGoogle Scholar
  15. 15.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014).
  16. 16.
    Li, Y., et al.: Towards differentially private truth discovery for crowd sensing systems. CoRR abs/1810.04760 (2018)Google Scholar
  17. 17.
    Liu, M., Chen, L., Liu, B., Wang, X.: VRCA: a clustering algorithm for massive amount of texts. In: IJCAI 2015, pp. 2355–2361. AAAI Press (2015).
  18. 18.
    Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. CoRR abs/1508.04025 (2015)Google Scholar
  19. 19.
    Mikolov, T., et al.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) NIPS, pp. 3111–3119. Curran Associates, Inc. (2013)Google Scholar
  20. 20.
    Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013)Google Scholar
  21. 21.
    Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)MathSciNetGoogle Scholar
  22. 22.
    Nie, Y., Han, Y., Huang, J., Jiao, B., Li, A.: Attention-based encoder-decoder model for answer selection in question answering. Front. Inf. Technol. Electron. Eng. 18(4), 535–544 (2017)CrossRefGoogle Scholar
  23. 23.
    Rangrej, A., Kulkarni, S., Tendulkar, A.V.: Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011, pp. 111–112. ACM, New York (2011)Google Scholar
  24. 24.
    Shafiei, M.M., Milios, E.E.: Latent Dirichlet co-clustering. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 542–551, December 2006Google Scholar
  25. 25.
    Wang, F., Zhang, C., Li, T.: Regularized clustering for documents. In: SIGIR 2007, pp. 95–102. ACM, New York (2007)Google Scholar
  26. 26.
    Xun, G., Li, Y., Zhao, W.X., Gao, J., Zhang, A.: A correlated topic model using word embeddings. In: IJCAI 2017, pp. 4207–4213 (2017)Google Scholar
  27. 27.
    Yin, J., Wang, J.: A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636, May 2016Google Scholar
  28. 28.
    Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: KDD 2014, pp. 233–242. ACM, New York (2014)Google Scholar
  29. 29.
    Yu, G., Huang, R., Wang, Z.: Document clustering via Dirichlet process mixture model with feature selection. In: KDD 2010, pp. 763–772. ACM, New York (2010)Google Scholar
  30. 30.
    Zhang, H., Li, Y., Ma, F., Gao, J., Su, L.: Texttruth: an unsupervised approach to discover trustworthy information from multi-sourced text data. In: KDD 2018, pp. 2729–2737. ACM, New York (2018).

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Tiehang Duan
    • 1
    Email author
  • Qi Lou
    • 2
  • Sargur N. Srihari
    • 1
  • Xiaohui Xie
    • 2
  1. 1.Department of Computer Science and EngineeringState University of New York at BuffaloBuffaloUSA
  2. 2.Department of Computer ScienceUniversity of California, IrvineIrvineUSA

Personalised recommendations