Abstract
Topic modeling automatically infers the hidden themes in a collection of documents. There are several developed techniques for topic modeling, which are broadly categorized into Algebraic, Probabilistic and Neural. In this paper, we use an Arabic dataset to experiment and compare six models (LDA, NMF, CTM, ETM, and two Bertopic variants). The comparison used evaluation metrics of topic coherence, diversity, and computational cost. The results show that among all the presented models, the neural BERTopic model with Roberta-based sentence transformer achieved the highest coherence score (0.1147), which is 36% above Bertopic with Arabert (the second best in coherence). At the same time, the topic diversity is 6% lower than the CTM model (the second best in diversity) at the cost of doubling the computation time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abuzayed, A., Al-Khalifa, H.: BERT for Arabic topic modeling: an experimental study on BERTopic technique. Proc. Comput. Sci. 189, 191ā194 (2021)
Al Qudah, I., Hashem, I., Soufyane, A., Chen, W., Merabtene, T.: Applying latent Dirichlet allocation technique to classify topics on sustainability using Arabic text. In: Arai, K. (ed.) SAI 2022. LNNS, vol. 506, pp. 630ā638. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-10461-9_43
Alhaj, F., Al-Haj, A., Sharieh, A., Jabri, R.: Improving Arabic cognitive distortion classification in Twitter using BERTopic. Int. J. Adv. Comput. Sci. Appl. 13(1), 854ā860 (2022)
Alshalan, R., Al-Khalifa, H., Alsaeed, D., Al-Baity, H., Alshalan, S.: Detection of hate speech in COVID-19-related tweets in the Arab region: deep learning and topic modeling approach. J. Med. Internet Res. 22(12), e22609 (2020)
Alshammeri, M., Atwell, E., Alsalka, M.A.: Quranic topic modelling using paragraph vectors. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1251, pp. 218ā230. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55187-2_19
Bianchi, F., Terragni, S., Hovy, D.: Pre-training is a hot topic: contextualized document embeddings improve topic coherence. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 759ā766. Association for Computational Linguistics (2021)
Bianchi, F., Terragni, S., Hovy, D., Nozza, D., Fersini, E.: Cross-lingual contextualized topic models with zero-shot learning. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1676ā1683. Association for Computational Linguistics (2021)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993ā1022 (2003)
Cao, Z., Li, S., Liu, Y., Li, W., Ji, H.: A novel neural topic model and its supervised extension. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1 (2015)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391ā407 (1990)
Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439ā453 (2020)
Grootendorst, M.: BERTopic: neural topic modeling with a class-based TF-IDF procedure. Technical report arXiv:2203.05794, arXiv (2022)
Miao, Y., Grefenstette, E., Blunsom, P.: Discovering discrete latent topics with neural variational inference. In: ICML (2017)
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100ā108. Association for Computational Linguistics, Los Angeles (2010)
Obeid, O., et al.: CAMeL tools: an open source python toolkit for arabic natural language processing. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 7022ā7032. European Language Resources Association, Marseille (2020)
OāCallaghan, D., Greene, D., Carthy, J., Cunningham, P.: An analysis of the coherence of descriptors in topic modeling. Expert Syst. Appl. 42(13), 5645ā5657 (2015)
Rafea, A., GabAllah, N.A.: Topic detection approaches in identifying topics and events from Arabic corpora. Proc. Comput. Sci. 142, 270ā277 (2018)
Schofield, A., Magnusson, M., Mimno, D.: Pulling out the stops: rethinking stopword removal for topic models. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 432ā436. Association for Computational Linguistics, Valencia (2017)
Terragni, S., Fersini, E., Galuzzi, B.G., Tropeano, P., Candelieri, A.: OCTIS: comparing and optimizing topic models is simple! In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 263ā270. Association for Computational Linguistics (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Abdelrazek, A., Medhat, W., Gawish, E., Hassan, A. (2022). Topic Modeling onĀ Arabic Language Dataset: Comparative Study. In: Fournier-Viger, P., et al. Advances in Model and Data Engineering in the Digitalization Era. MEDI 2022. Communications in Computer and Information Science, vol 1751. Springer, Cham. https://doi.org/10.1007/978-3-031-23119-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-23119-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23118-6
Online ISBN: 978-3-031-23119-3
eBook Packages: Computer ScienceComputer Science (R0)