Clustering with Probabilistic Topic Models on Arabic Texts

Part of the Studies in Computational Intelligence book series (SCI, volume 488)


Recently, probabilistic topic models such as LDA (Latent Dirichlet Allocation) have been widely used for applications in many text mining tasks such as retrieval, summarization, and clustering on different languages. In this paper we present a first comparative study between LDA and K-means, two well-known methods respectively in topics identification and clustering applied on Arabic texts. Our aim is to compare the influence of morpho-syntactic characteristics of Arabic language on performance of first method compared to the second one. In order to study different aspects of those methods the study is conducted on benchmark document collection in which the quality of clustering was measured by the use of two well-known evaluation measure, F-measure and Entropy. The results consistently show that LDA perform best results more than K-means in most cases.


Clustering topics identification Arabic text LDA K-means preprocessing stemming 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abbas, M., Smaili, K., Berkani, D.: Multi-Category Support Vector Machines for Identifying Arabic Topics. Advances in Computational Linguistics, Special issue of Journal of Research in computing Science 41, 217–226 (2009)Google Scholar
  2. 2.
    Blei, D., Lafferty, J.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning (2006)Google Scholar
  3. 3.
    Blei, D., Lafferty, J.: A correlated topic model of science. Annals of Applied Statistics 1(1), 17–35 (2007)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)MATHGoogle Scholar
  5. 5.
    Brahmi, A., Ech-cherif, E., Benyettou, A.: Arabic texts analysis for topic modeling evaluation. Information Retrieval 14 (2011)Google Scholar
  6. 6.
    Darwish, K., Oard, D.W.: Evidence combination for Arabic-English retrieval. In: TREC, pp. 703–710. NIST, Gaithersburg (2002)Google Scholar
  7. 7.
    Darwish, K., Hassan, H., Emam, O.: Examining the Effect of Improved Context Sensitive Morphology on Arabic Information Retrieval. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, USA, pp. 25–30 (2005)Google Scholar
  8. 8.
    Diab, M., Hacioglu, K., Jurafsky, D.: Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In: Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL 2004), USA, pp. 149–152 (2004)Google Scholar
  9. 9.
    El Sulaiti, L.: L’arabe contemporain. Radio Qatar, Qatar (2003) Google Scholar
  10. 10.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Science 101, 5228–5235 (2004)CrossRefGoogle Scholar
  11. 11.
    Huot, CH., Coupet, P.: Le Text Mining sur la langue Arabe : application au traitement des sources ouvertes. TEMIS SA, Paris, France (2005) Google Scholar
  12. 12.
    Larkey, L.S., Ballesteros, L., Connell, M.E.: Arabic Computational Morphology. In: Light Stemming for Arabic Information Retrieval. Springer (2007)Google Scholar
  13. 13.
    Larsen, B., Aone, C.: Fast and effective text mining using linear time document clustering. In: Proceedings of the Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)Google Scholar
  14. 14.
    Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retrieval 14(2001), 178–203 (2011)CrossRefGoogle Scholar
  15. 15.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, pp. 327–331. Cambridge University Press, Cambridge (2008)CrossRefMATHGoogle Scholar
  16. 16.
    Mccallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002),
  17. 17.
    Řehůřek, R., Sojka, P.: Gensim – Python Framework for Vector Space Modelling, NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic (2011),
  18. 18.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979)Google Scholar
  19. 19.
    Rosenzvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Alberta, Canada (2004)Google Scholar
  20. 20.
    Saad, M.K., Achour, W.: OSAC: Open Source Arabic Corpora, 6th ArchEng International Symposiums. In: The 6th International Symposium on Electrical and Electronics Engineering and Computer Science, pp. 118–123. European University of Lefke, Cyprus (2010)Google Scholar
  21. 21.
    Sawaf, H., Zaplo, J., Ney, H.: Statistical Classification Methods for Arabic News Articles. In: Proceedings of the ACL/EACL Workshop on ARABIC Language Processing: Status and Prospects, Toulouse, France (2001)Google Scholar
  22. 22.
    Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop, Text Mining, Minnesota, USA (2000)Google Scholar
  24. 24.
    Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis, Technical Report #01-40, University of Minnesota (2001)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  1. 1.Computer sciences DepartmentUniversity of may 08, 1945GuelmaAlgeria
  2. 2.LRI Laboratory, Computer sciences DepartmentUniversity of Badji MokhtarAnnabaAlgeria

Personalised recommendations