The Benefit of Document Embedding in Unsupervised Document Classification

  • Jaromír NovotnýEmail author
  • Pavel Ircing
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11096)


The aim of this article is to show that the document embedding using the doc2vec algorithm can substantially improve the performance of the standard method for unsupervised document classification – the K-means clustering. We have performed rather extensive set of experiments on one English and two Czech datasets and the results suggest that representing the documents using vectors generated by the doc2vec algorithm brings a consistent improvement across languages and datasets. The English dataset – 20NewsGroups – was processed in a way that allows direct comparison with the results of both supervised and unsupervised algorithms published previously. Such comparison is provided in the paper, together with the results of supervised classification achieved by the state-of-the-art SVM classifier.


Document embedding Doc2vec Classification K-means SVM 



This research was supported by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506.


  1. 1.
    Chinniyan, K., Gangadharan, S., Sabanaikam, K.: Semantic similarity based web document classification using support vector machine. Int. Arab J. Inf. Technol. (IAJIT) 14(3), 285–292 (2017)Google Scholar
  2. 2.
    Hamdi, A., Voerman, J., Coustaty, M., Joseph, A., d’Andecy, V.P., Ogier, J.M.: Machine learning vs deterministic rule-based system for document stream segmentation. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 5, pp. 77–82. IEEE (2017)Google Scholar
  3. 3.
    Jiang, M., et al.: Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29(1), 61–70 (2018)CrossRefGoogle Scholar
  4. 4.
    Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016)
  5. 5.
    Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2418–2424 (2015)Google Scholar
  6. 6.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5-th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  7. 7.
    Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)Google Scholar
  8. 8.
    Novotný, J., Ircing, P.: Unsupervised document classification and topic detection. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 748–756. Springer, Cham (2017). Scholar
  9. 9.
    Pedregosa, F.: Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011). http://scikit-learn.orgMathSciNetzbMATHGoogle Scholar
  10. 10.
    Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010).
  11. 11.
    Siolas, G., d’Alche Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN), vol. 5, pp. 205–209 (2000)Google Scholar
  12. 12.
    Slonim, N., Friedman, N., Tishby, N.: Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 129–136 (2002)Google Scholar
  13. 13.
    Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13–18 (2014)Google Scholar
  14. 14.
    Švec, J., et al.: General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang. Resour. Eval. 48(2), 227–248 (2014). Scholar
  15. 15.
    Trieu, L.Q., Tran, H.Q., Tran, M.T.: News classification from social media using twitter-based doc2vec model and automatic query expansion. In: Proceedings of the Eighth International Symposium on Information and Communication Technology, pp. 460–467. ACM (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Faculty of Applied Sciences, Department of CyberneticsThe University of West BohemiaPlzeňCzech Republic

Personalised recommendations