Experimental Comparison of Unsupervised Approaches in the Task of Separating Specializations Within Professions in Job Vacancies
In this article an unsupervised approach for analysis of labor market requirements allowing to solve the problem of discovering latent specializations within broadly defined professions is presented. For instance, for the profession of “programmer” such specializations could be “CNC programmer”, “mobile developer”, “frontend developer” and so on. Various statistical methods of text vector representations: TF-IDF, probabilistic topic modeling, neural language models based on distributional semantics (word2vec, fasttext) and deep contextualized word representation (ELMo and multilingual BERT) have been experimentally evaluated. Both pre-trained models and models trained on the texts of job vacancies in Russian have been investigated. The experiments were conducted on dataset provided by online recruitment platforms. Several types of clustering methods: K-means, Affinity Propagation, BIRCH, Agglomerative clustering, and HDBSCAN have been tested. In case of predetermined clusters’ number (k-means, agglomerative) the best result was achieved by ARTM. However, if the number of clusters was not specified ahead of time, word2vec trained on our job vacancies dataset has outperformed other models. The models trained on our corpora perform much better than pre-trained models with large even multilingual vocabulary.
KeywordsNatural language processing Vector space model Word embedding Topic models Clustering methods Neural language model
The reported study was partially funded by RFBR according to the research project No. 18-47-860013 Intelligent system for the formation of educational programs based on neural network models of natural language to meet the requirements of the digital economy.
- 1.Ikudo, A., et al.: Occupational Classifications: A Machine Learning Approach. No. w24951. National Bureau of Economic Research (2018)Google Scholar
- 3.Colombo, E., Mercorio, F., Mezzanzanica, M.: Applying machine learning tools on web vacancies for labour market and skill analysis (2018)Google Scholar
- 4.Wowczko, I.: Skills and vacancy analysis with data mining techniques. Informatics, vol. 2, no. 4. Multidisciplinary Digital Publishing Institute (2015)Google Scholar
- 5.Spirin, N., Karahalios, K.: Unsupervised approach to generate informative structured snippets for job search engines. In: Proceedings of the 22nd International Conference on World Wide Web. ACM (2013)Google Scholar
- 6.Muthyala, R., et al.: Data-driven job search engine using skills and company attribute filters. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE (2017)Google Scholar
- 7.Deokar, S.T.: Text documents clustering using K-means algorithm. Int. J. Technol. Eng. Sci. [IJTES] 1(4), 282–286 (2013)Google Scholar
- 8.Zhu, Y., Yu, J., Jia, C.: Initializing k-means clustering using affinity propagation. In: 2009 9th International Conference on Hybrid Intelligent Systems, vol. 1. IEEE (2009)Google Scholar
- 10.Gencoglu, O.: Deep representation learning for clustering of health tweets. arXiv preprint arXiv:1901.00439 (2018)
- 14.Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12580-0_3CrossRefGoogle Scholar
- 17.Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
- 18.Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)