Skip to main content

LDA and LSI as a Dimensionality Reduction Method in Arabic Document Classification

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 538))

Included in the following conference series:

Abstract

In this work, we made an experimental study for compare two approaches of reduction dimensionality and verify their effectiveness in Arabic document classification. Firstly, we apply latent Dirichlet allocation (LDA) and latent semantic indexing (LSI) for modeling our document sets OATC (open Arabic Tunisian corpus) contained 20.000 documents collected from Tunisian newspapers. We generate two matrices LDA (documents/topics) and LSI (documents/topics). Then, we use the SVM algorithm for document classification, which is known as an efficient method for text mining. Classification results are evaluated by precision, recall and F-measure. The evaluation of classification results was performed on OATC corpus (70 % training set and 30 % testing set). Our experiment shows that the results of dimensionality reduction via LDA outperform LSI in Arabic topic classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://www.attounissia.com.tn/.

  2. 2.

    http://www.alchourouk.com/.

  3. 3.

    http://www.assabahnews.tn/.

  4. 4.

    http://jomhouria.com/.

References

  1. Berry, M.W.: Large-scale sparse singular value computations. Int. J. Supercomputer Appl. 6(1), 13–49 (1992)

    Article  Google Scholar 

  2. Song, F., Liu, S., Yang, J.: A comparative study on text representation schemes in text categorization. Pattern Anal. Appl. 8(1–2), 199–209 (2005)

    Article  MathSciNet  Google Scholar 

  3. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  4. Hotho, A., Nürnberger, A., Paaß, G.: A brief survey of text mining. In: Ldv Forum, pp. 19–62 (2005)

    Google Scholar 

  5. Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  Google Scholar 

  6. Bernotas, M., Karklius, K., Laurutis, R., et al.: The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Inf. Technol. Control 36(2), 117–220 (2015)

    Google Scholar 

  7. Ayadi, R., Maraoui, M., Zrigui, M.: Intertextual distance for Arabic texts classification. In: International Conference for Internet Technology and Secured Transactions, ICITST 2009, pp. 1–6. IEEE (2009)

    Google Scholar 

  8. Lan, M., Tan, C.L., Su, J., et al.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  9. Altinçay, H., Erenel, Z.: Analytical evaluation of term weighting schemes for text categorization. Pattern Recogn. Lett. 31(11), 1310–1323 (2010)

    Article  Google Scholar 

  10. Li, Y.H., Jain, A.K.: Classification of text documents. Comput. J. 41(8), 537–546 (1998)

    Article  Google Scholar 

  11. Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering. KI 16(4), 48–54 (2002)

    Google Scholar 

  12. Cavnar, W.: Using an n-gram-based document representation with a vector processing retrieval model. NIST Special Publication SP, pp. 269–269 (1995)

    Google Scholar 

  13. Milios, E., Zhang, Y., He, B., et al. Automatic term extraction and document similarity in special text corpora. In: Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics, pp. 275–284 (2003)

    Google Scholar 

  14. Wei, C.-P., Yang, C.C., Lin, C.-M.: A latent semantic indexing-based approach to multilingual document clustering. Decis. Support Syst. 45(3), 606–620 (2008)

    Article  Google Scholar 

  15. Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147 (2006)

    Google Scholar 

  16. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  17. Duwairi, R., Al-Refai, M.N., Khasawneh, N.: Feature reduction techniques for Arabic text categorization. J. Am. Soc. Inform. Sci. Technol. 60(11), 2347–2352 (2009)

    Article  Google Scholar 

  18. Harrag, F., El-Qawasmah, E., Al-Salman, A.M.S.: Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In: 2010 First International Conference on Integrated Intelligent Computing (ICIIC), pp. 6–11. IEEE (2010)

    Google Scholar 

  19. Thabtah, F., et al.: VSMs with K-Nearest Neighbour to categorise Arabic text data (2008)

    Google Scholar 

  20. Said, D., Wanas, N., Darwish, N., et al.: A study of Arabic text preprocessing methods for text categorization. In: The 2nd International Conference on Arabic Language Resources and Tools, Cairo, Egypt (2009)

    Google Scholar 

  21. Saad, E.M., Awadalla, M.H., Alajmi, A.F. Dewy index based Arabic document classification with synonyms merge feature reduction. In: IJCSI (2011)

    Google Scholar 

  22. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  23. Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. In: ACM, pp. 659–661 (2002)

    Google Scholar 

  24. Yamamoto, M., Sadamitsu, K.: Dirichlet mixtures in text modeling. University of Tsukuba, CS Technical report CS-TR-05-1 (2005)

    Google Scholar 

  25. Masada, T., Kiyasu, S., Miyahara, S.: Comparing LDA with pLSI as a dimensionality reduction method in document clustering. In: Tokunaga, T., Ortega, A. (eds.) LKR 2008. LNCS (LNAI), vol. 4938, pp. 13–26. Springer, Heidelberg (2008)

    Google Scholar 

  26. Kakkonen, T., Myller, N., Sutinen, E., et al.: Comparison of dimension reduction methods for automated essay grading. J. Educ. Technol. Soc. 11(3), 275–288 (2008)

    Google Scholar 

  27. Zrigui, M., Ayadi, R., Mars, M., et al.: Arabic text classification framework based on latent dirichlet allocation. CIT. J. Comput. Inf. Technol. 20(2), 125–140 (2012)

    Article  Google Scholar 

  28. Ayadi, R., Maraoui, M., Zrigui, M.: SCAT: a system of classification for Arabic texts. Int. J. Internet Technol. Secured Trans. 3(1), 63–80 (2011)

    Article  Google Scholar 

  29. Joachims, T.: Making large scale SVM learning practical. Universität Dortmund (1999)

    Google Scholar 

  30. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Springer, Berlin, Heidelberg (1998)

    Google Scholar 

  31. Berry, M., Do, T., O’Brien, G., et al.: SVDPACKC (Version 1.0) User’s Guide1 (1993)

    Google Scholar 

  32. Phan, X.-H., Nguyen, C.-T.: GibbsLDA++: AC/C++ implementation of latent Dirichlet allocation (LDA) (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rami Ayadi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ayadi, R., Maraoui, M., Zrigui, M. (2015). LDA and LSI as a Dimensionality Reduction Method in Arabic Document Classification. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2015. Communications in Computer and Information Science, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-319-24770-0_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24770-0_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24769-4

  • Online ISBN: 978-3-319-24770-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics