Advertisement

Feature Extraction in Subject Classification of Text Documents in Polish

  • Tomasz WalkowiakEmail author
  • Szymon Datko
  • Henryk Maciejewski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10842)

Abstract

In this work we evaluate two different methods for deriving features for a subject classification of text documents. The first method uses the standard Bag-of-Words (BoW) approach, which represents the documents with vectors of frequencies of selected terms appearing in the documents. This method heavily relies on the natural language processing (NLP) tools to properly preprocess text in the grammar- and inflection-conscious way. The second approach is based on the word-embedding technique recently proposed by Mikolov and does not require any NLP preprocessing. In this method the words are represented as vectors in continuous space and this representation of words is used to construct the feature vectors of the documents. We evaluate these fundamentally different approaches in the task of classification of Polish language Wikipedia articles with 34 subject areas. Our study suggests that the word-embedding based features seem to outperform the standard NLP-based features providing sufficiently large training dataset is available.

Keywords

Text mining Subject classification Bag of words Word embedding fastText 

Notes

Acknowledgement

This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).

References

  1. 1.
    Eder, M., Piasecki, M., Walkowiak, T.: An open stylometric system based on multilevel text analysis. Cogn. Stud.—Etudes Cogn. (17) (2017).  https://doi.org/10.11649/cs.1430
  2. 2.
    Goodman, J.: Classes for fast maximum entropy training. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, (Cat. No.01CH37221), vol. 1, pp. 561–564 (2001).  https://doi.org/10.1109/ICASSP.2001.940893
  3. 3.
    Harris, Z.: Distributional structure. Word (1954)Google Scholar
  4. 4.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
  5. 5.
    Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)Google Scholar
  6. 6.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781
  7. 7.
    Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Association for Computational Linguistics, Atlanta, June 2013. http://www.aclweb.org/anthology/N13-1090
  8. 8.
    Młynarczyk, K., Piasecki, M.: Wiki test - 34 categories (2015). http://hdl.handle.net/11321/217. CLARIN-PL digital repository
  9. 9.
    Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories (2015). http://hdl.handle.net/11321/222. CLARIN-PL digital repository
  10. 10.
    Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-35647-6_16CrossRefGoogle Scholar
  11. 11.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  12. 12.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1986)Google Scholar
  13. 13.
    Torkkola, K.: Discriminative features for text document classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004).  https://doi.org/10.1007/s10044-003-0196-8
  14. 14.
    Walkowiak, T.: Language processing modelling notation - orchestration of NLP microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2017. AISC, pp. 464–473. Springer International Publishing, Cham (2018).  https://doi.org/10.1007/978-3-319-59415-6_44Google Scholar
  15. 15.
    Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART 2018, vol. 2, pp. 515–522. INSTICC, SciTePress (2018)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Tomasz Walkowiak
    • 1
    Email author
  • Szymon Datko
    • 1
  • Henryk Maciejewski
    • 1
  1. 1.Faculty of ElectronicsWrocław University of Science and TechnologyWrocławPoland

Personalised recommendations