Abstract
The Dirichlet Compound Multinomial (DCM), the composition of the Dirichlet and the multinomial, is a widely accepted generative model for text documents that takes into account burstiness. However, recent research showed that the Dirichlet is not the best to be chosen as a prior to multinomial. In this paper, we propose a novel model called the Multinomial Scaled Dirichlet (MSD) distribution that is the composition of the scaled Dirichlet distribution and the multinomial. Moreover, we investigate the Expectation Maximization (EM) with the MSD mixture model as a new clustering algorithm for documents. Experiments show that the new model is competitive with the best state-of-the-art methods on different text data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cerchiello, P., Giudici, P.: Dirichlet compound multinomials statistical models. Appl. Math. 3(12), 2089–2097 (2012)
Aggarwal, C.C., Zhai, C.: An introduction to text mining. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 1–10. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_1
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Church, K.W., Gale, W.A.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995)
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning ICML, vol. 3, pp. 616–623 (2003)
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552. ACM (2005)
Margaritis, D., Thrun, S.: A Bayesian multiresolution independence test for continuous variables. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 346–353. Morgan Kaufmann Publishers Inc. (2001)
Mosimann, J.E.: On the compound multinomial distribution, the multivariate \(\beta \)-distribution, and correlations among proportions. Biometrika 49(1/2), 65–82 (1962)
Migliorati, S., Monti, G.S., Ongaro, A.: E-M algorithm: an application to a mixture model for compositional data. In: Proceedings of the 44th Scientific Meeting of the Italian Statistical Society (2008)
Lochner, R.H.: A generalized Dirichlet distribution in Bayesian life testing. J. Royal Stat. Soc. Ser. B (Methodological) 37, 103–113 (1975)
Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)
Bouguila, N.: Count data modeling and classification using finite mixtures of distributions. IEEE Trans. Neural Netw. 22(2), 186–198 (2011)
Teevan, J., Karger, D.R.: Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 18–25. ACM (2003)
Jansche, M.: Parametric models of linguistic count data. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 288–295. Association for Computational Linguistics (2003)
Katz, S.M.: Distribution of content words and phrases in text and language modelling. Nat. Lang. Eng. 2(1), 15–59 (1996)
Monti, G.S., Mateu-Figueras, G., Pawlowsky-Glahn, V.: Notes on the scaled Dirichlet distribution. In: Compositional Data Analysis: Theory and Applications. Wiley, Chichester (2011)
Hankin, R.K., et al.: A generalization of the Dirichlet distribution. J. Stat. Softw. 33(11), 1–18 (2010)
Oboh, B.S., Bouguila, N.: Unsupervised learning of finite mixtures using scaled Dirichlet distribution and its application to software modules categorization. In: Proceedings of the 2017 IEEE International Conference on Industrial Technology (ICIT), pp. 1085–1090. IEEE (2017)
Bouguila, N., Ziou, D.: Unsupervised learning of a finite discrete mixture: applications to texture modeling and image databases summarization. J. Vis. Commun. Image Representation 18(4), 295–309 (2007)
Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296. ACM (2006)
McCallum, A.K.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1996). http://www.cs.cmu.edu/mccallum/bow
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Zamzami, N., Bouguila, N. (2018). Text Modeling Using Multinomial Scaled Dirichlet Distributions. In: Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M. (eds) Recent Trends and Future Technology in Applied Intelligence. IEA/AIE 2018. Lecture Notes in Computer Science(), vol 10868. Springer, Cham. https://doi.org/10.1007/978-3-319-92058-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-92058-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92057-3
Online ISBN: 978-3-319-92058-0
eBook Packages: Computer ScienceComputer Science (R0)