Experimental Comparison of Unsupervised Approaches in the Task of Separating Specializations Within Professions in Job Vacancies

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1119)


In this article an unsupervised approach for analysis of labor market requirements allowing to solve the problem of discovering latent specializations within broadly defined professions is presented. For instance, for the profession of “programmer” such specializations could be “CNC programmer”, “mobile developer”, “frontend developer” and so on. Various statistical methods of text vector representations: TF-IDF, probabilistic topic modeling, neural language models based on distributional semantics (word2vec, fasttext) and deep contextualized word representation (ELMo and multilingual BERT) have been experimentally evaluated. Both pre-trained models and models trained on the texts of job vacancies in Russian have been investigated. The experiments were conducted on dataset provided by online recruitment platforms. Several types of clustering methods: K-means, Affinity Propagation, BIRCH, Agglomerative clustering, and HDBSCAN have been tested. In case of predetermined clusters’ number (k-means, agglomerative) the best result was achieved by ARTM. However, if the number of clusters was not specified ahead of time, word2vec trained on our job vacancies dataset has outperformed other models. The models trained on our corpora perform much better than pre-trained models with large even multilingual vocabulary.


Natural language processing Vector space model Word embedding Topic models Clustering methods Neural language model 



The reported study was partially funded by RFBR according to the research project No. 18-47-860013 Intelligent system for the formation of educational programs based on neural network models of natural language to meet the requirements of the digital economy.


  1. 1.
    Ikudo, A., et al.: Occupational Classifications: A Machine Learning Approach. No. w24951. National Bureau of Economic Research (2018)Google Scholar
  2. 2.
    Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Using machine learning for labour market intelligence. In: Altun, Y., et al. (eds.) ECML PKDD 2017. LNCS (LNAI), vol. 10536, pp. 330–342. Springer, Cham (2017). Scholar
  3. 3.
    Colombo, E., Mercorio, F., Mezzanzanica, M.: Applying machine learning tools on web vacancies for labour market and skill analysis (2018)Google Scholar
  4. 4.
    Wowczko, I.: Skills and vacancy analysis with data mining techniques. Informatics, vol. 2, no. 4. Multidisciplinary Digital Publishing Institute (2015)Google Scholar
  5. 5.
    Spirin, N., Karahalios, K.: Unsupervised approach to generate informative structured snippets for job search engines. In: Proceedings of the 22nd International Conference on World Wide Web. ACM (2013)Google Scholar
  6. 6.
    Muthyala, R., et al.: Data-driven job search engine using skills and company attribute filters. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE (2017)Google Scholar
  7. 7.
    Deokar, S.T.: Text documents clustering using K-means algorithm. Int. J. Technol. Eng. Sci. [IJTES] 1(4), 282–286 (2013)Google Scholar
  8. 8.
    Zhu, Y., Yu, J., Jia, C.: Initializing k-means clustering using affinity propagation. In: 2009 9th International Conference on Hybrid Intelligent Systems, vol. 1. IEEE (2009)Google Scholar
  9. 9.
    Guan, R., et al.: Text clustering with seeds affinity propagation. IEEE Trans. Knowl. Data Eng. 23(4), 627–637 (2011)CrossRefGoogle Scholar
  10. 10.
    Gencoglu, O.: Deep representation learning for clustering of health tweets. arXiv preprint arXiv:1901.00439 (2018)
  11. 11.
    Parhomenko, P.A., Grigorev, A.A., Astrakhantsev, N.A.: A survey and an experimental comparison of methods for text clustering: application to scientific articles. Trudy ISP RAN/Proc. ISP RAS 29(2), 161–200 (2017)CrossRefGoogle Scholar
  12. 12.
    Chen, J., Tao, Y., Lin, H.: Visual exploration and comparison of word embeddings. J. Vis. Lang. Comput. 48, 178–186 (2018)CrossRefGoogle Scholar
  13. 13.
    Naili, M., Chaibi, A.H., Ben Ghezala, H.H.: Comparative study of word embedding methods in topic segmentation. Procedia Comput. Sci. 112, 340–349 (2017)CrossRefGoogle Scholar
  14. 14.
    Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). Scholar
  15. 15.
    Vorontsov, K.V., Potapenko, A.A.: Additive regularization of topic models. Mach. Learn. 101(1), 303–323 (2015)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Kutuzov, A., Kuzmenko, E.: WebVectors: a toolkit for building web interfaces for vector semantic models. In: Ignatov, D., et al. (eds.) AIST 2016. CCIS, vol. 661, pp. 155–161. Springer, Cham (2017). Scholar
  17. 17.
    Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  18. 18.
    Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
  19. 19.
    Zuo, Y., Zhao, J., Ke, X.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Chelyabinsk State UniversityChelyabinskRussia

Personalised recommendations