Text Classification Based on Topic Modeling and Chi-square

  • Yujia SunEmail author
  • Jan Platoš
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1107)


This paper compares two topic modeling algorithms - Latent Dirichlet Allocation (LDA), Latent Semantic Index (LSI), and a feature selection algorithm chi-square to extract news feature words. After feature extraction, the three classifiers (Logistics Regression, Naive Bayes and SVM) are compared in news classification. Based on the test results, combined LSI and Logistics Regression gives the highest result compared to the other algorithms, with precision of 96% and recall of 95%.


  1. 1.
    Platos, J., Gajdos, P., Kromer, P., Snasel, V.: Non-negative matrix factorization on GPU. In: Second International Conference 2010, vol. 87, pp. 21–30. Springer, Heidelberg (2010)Google Scholar
  2. 2.
    Snasel, V., Nowakova, J., Xhafa, F., Barolli, L.: Geometrical and topological approaches to Big Data. J. Future Gener. Comput. Syst. 67, 286–296 (2017)CrossRefGoogle Scholar
  3. 3.
    Berry, M., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia (1999)zbMATHGoogle Scholar
  4. 4.
    Snasel, V., Gajdos, P., Abdulla, H.M.D., Polovincak, M.: Concept lattice reduction by matrix decompositins. DCCA (2007)Google Scholar
  5. 5.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  6. 6.
  7. 7.
    Van der Maaten, L., Hinton, G.E.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  8. 8.
    Platos, J., Kromer, P.: Prediction of multi-class industrial data. In: International Conference on Intelligent Networking and Collaborative Systems 2013, pp. 64–68 (2013)Google Scholar
  9. 9.
    Mantyla, M.V., Claes M., Farooq U.: Measuring LDA topic stability from clusters of replicated runs. In: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, p. 49. ACM (2018)Google Scholar
  10. 10.
    Linderman, G.C., Steinerberger, S.: Clustering with t-SNE, provably. arXiv preprint arXiv:1706.02582 (2017)
  11. 11.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefGoogle Scholar
  12. 12.
    McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: RecSys, pp. 165–172. ACM (2013)Google Scholar
  13. 13.
    Yang, X., Macdonald, C., Ounis, I.: Using word embeddings in twitter election classification. In: The SIGIR 2016 Workshop on Neural Information Retrieval (2016)Google Scholar
  14. 14.
    Sun, Y., Platoš, J.: CAPTCHA recognition based on Kohonen maps. In: International Conference on Intelligent Networking and Collaborative Systems 2019, pp. 296–305. Springer, Cham (2019)Google Scholar
  15. 15.
    Pan, J.S., Liu, J.L., Liu, E.J.: Improved whale optimization algorithm and its application to UCAV path planning problem. In: International Conference on Genetic and Evolutionary Computing 2018, vol. 834, pp. 37–47. Springer, Singapore (2018)Google Scholar
  16. 16.
    Chang, K.C., Pan, J.S., Chu, K.C., Horng, D.J., Jing, H.: Study on information and integrated of MES big data and semiconductor process furnace automation. In: International Conference on Genetic and Evolutionary Computing 2018, vol. 834, pp. 669–678. Springer, Singapore (2018)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.Technical University of OstravaOstrava-PorubaCzech Republic
  2. 2.Hebei GEO UniversityShijiazhuangChina

Personalised recommendations