Skip to main content

WoLMIS: a labor market intelligence system for classifying web job vacancies


In the last decades, an increasing number of employers and job seekers have been relying on Web resources to get in touch and to find a job. If appropriately retrieved and analyzed, the huge number of job vacancies available today on on-line job portals can provide detailed and valuable information about the Web Labor Market dynamics and trends. In particular, this information can be useful to all actors, public and private, who play a role in the European Labor Market. This paper presents WoLMIS, a system aimed at collecting and automatically classifying multilingual Web job vacancies with respect to a standard taxonomy of occupations. The proposed system has been developed for the Cedefop European agency, which supports the development of European Vocational Education and Training (VET) policies and contributes to their implementation. In particular, WoLMIS allows analysts and Labor Market specialists to make sense of Labor Market dynamics and trends of several countries in Europe, by overcoming linguistic boundaries across national borders. A detailed experimental evaluation analysis is also provided for a set of about 2 million job vacancies, collected from a set of UK and Irish Web job sites from June to September 2015.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    The Commission Communication “New Skills for New Jobs” (COM(2008) 868, 16.12.2008)

  2. 2.

    The Commission Communication “An Agenda for new skills and jobs: A European contribution toward full employment” (COM(2010) 682, 23.11.2010)

  3. 3.

  4. 4.

    The Commission Communication “A New Skills Agenda for Europe” COM(2016) 381/2, available at

  5. 5.

  6. 6.

    Real-time Labor Market information on skill requirements: feasibility study and working prototype. Cedefop Reference number AO/RPA/VKVET-NSOFRO/Real-time LMI/010/14. Contract notice 2014/S 141-252026 of 15/07/2014

  7. 7.

  8. 8.

    For more information on SOC2000, the interested reader can refer to SOC2000 (2016).

  9. 9.

  10. 10.

  11. 11.

  12. 12.

  13. 13.

  14. 14.

  15. 15.

  16. 16.

  17. 17.

  18. 18.

  19. 19.

    The previously cited extension of the Standard Occupational Classification (SOC) system developed by the U.S. Bureau of Labor Statistics.

  20. 20.

    As it will be illustrated in Section 5.2 in Table 4, the 10% of (the most representative) title words are enough to achieve 80% of classification accuracy. Nevertheless, the table shows that the best performances are achieved using all the title words.

  21. 21.

    The market in which workers find an employment, employers find available workers, and wage rates are determined.

  22. 22.

  23. 23.

    The European Network on Regional Labor Market Monitoring (ENRLMM 2016).

  24. 24.

    The European classification system for economical sectors, see

  25. 25.

    Generally speaking, an n-gram is a set of n consecutive words.

  26. 26.

    The visiting frequency was tuned for each Web site taking into account: the publishing rate, the average time an advertisement is kept on-line, and suggestions of the Web masters who accepted to collaborate with the project.

  27. 27.

    Actually, there are some vacancies, mostly looking for language teachers.

  28. 28.

    According to (ISCO 2012), “Water and firewood collectors” gather water and firewood and transport them on foot or using hand or animal carts.

  29. 29.

    sklearn.svm.LinearSVC is a wrapper around the liblinear library (Fan et al. 2008), while sklearn.svm.SVC is a wrapper around the libsvm library (Chang & Lin 2011).

  30. 30.

    Also known as weighted averaging.

  31. 31.

    A 3-layer (of which 1 hidden layer) Neural Network has the ability to properly address linear classification problems (Jain et al. 1996; Lippmann 1987).

  32. 32.

    The lower quartile is the 25th percentile while the upper quartile is the 75th percentile.

  33. 33.

  34. 34.

  35. 35.

  36. 36.

    For an updated list, see


  1. Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato, V., Persia, F., & Picariello, A. (2015). Challenge: processing web texts for classifying job offers. In 2015 IEEE international conference on semantic computing (ICSC) (pp. 460–463).

  2. Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato, V., Persia, F., & Picariello, A. (2015). Classification of job advertisements: a case study. In 23rd Italian symposium on advanced database systems, SEBD 2015, Gaeta, Italy, June 14-17, 2015 (pp. 144–151).

  3. Andrews, S., Gibson, H., Domdouzis, K., & Akhgar, B. (2016). Creating corroborated crisis reports from social media data through formal concept analysis. Journal of Intelligent Information Systems, 47(2), 287–312.

    Article  Google Scholar 

  4. Beblavỳ, M., Fabo, B., & Lenaerts, K. (2016). Skills requirements for the 30 most-frequently advertised occupations in the united states: an analysis based on online vacancy data. Tech. Rep. 132, Centre for European Policy Studies (CEPS).

  5. Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In International conference on discovery science (pp. 115). Springer.

  6. Boselli, R., Cesarini, M., Mercorio, F., & Mezzanzanica, M. (2014). Planning meets data cleansing. In The 24th international conference on automated planning and scheduling (ICAPS) (pp. 439–443).

  7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  Google Scholar 

  8. Califf, M.E. (1998). Relational learning techniques for natural language information extraction. Ph.D. thesis University of Texas at Austin.

  9. Califf, M.E., & Mooney, R.J. (1999). Relational learning of pattern-match rules for information extraction. In AAAI/IAAI (pp. 328–334).

  10. Carnevale, A.P., Jayasundera, T., & Repnikov, D. (2014). Understanding online job ads data: a technical report. Tech. rep., Georgetown University, McCourt School on Public Policy, Center on Education and the Workforce.

  11. Ceci, M., & Malerba, D. (2007). Classifying web documents in a hierarchy of categories: a comprehensive study. Journal of Intelligent Information Systems, 28(1), 37–78.

    Article  Google Scholar 

  12. Cesarini, M., Mezzanzanica, M., & Fugini, M. (2007). Analysis-sensitive conversion of administrative data into statistical information systems. Journal of Cases on Information Technology, 9(4), 57–81.

    Article  Google Scholar 

  13. Chang, C.C., & Lin, C.J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.

    Google Scholar 

  14. Crowther, P.S., & Cox, R.J. (2005). A method for optimal division of data sets for use in neural networks. In Khosla, R., Howlett, R.J., & Jain, L.C. (Eds.) 9th International conference on knowledge-based intelligent information and engineering systems, KES 2005, Melbourne, Australia, September 14-16, 2005, Proceedings, Part IV (pp. 1–7). Berlin: Springer.

    Google Scholar 

  15. Elias, P., & Purcell, K. (2004). Soc (he): a classification of occupations for studying the graduate labour market. Tech. rep., Institute for Employment Research, University of Warwick, Coventry, UK.

  16. ENRLMM (2016). The european network on regional labour market monitoring. Visited on 2016-11-11.

  17. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., & Lin, C.J. (2008). Liblinear: a library for large linear classification. The Journal of Machine Learning Research, 9 (Aug), 1871–1874.

    MATH  Google Scholar 

  18. Freitag, D., & Kushmerick, N. (2000). Boosted wrapper induction. In AAAI/IAAI (pp. 577–583).

  19. Haykin, S. (1999). A comprehensive foundation of neural networks. Upper Saddle River: Prentice Hall.

    MATH  Google Scholar 

  20. Hong, W., Zheng, S., & Wang, H. (2013). Dynamic user profile-based job recommender system. In 2013 8th international conference on computer science & education (ICCSE) (pp. 1499–1503). IEEE.

  21. Hsu, C.W., Chang, C.C., & Lin Chih-Jen, E. (2003). A practical guide to support vector classification. Tech. rep., Department of Computer Science and Information Engineering, National Taiwan University.

  22. ISCO (2012). International standard classification of Occupations. Visited on 2016-11-11.

  23. Jain, A.K., Mao, J., & Mohiuddin, K.M. (1996). Artificial neural networks: a tutorial. IEEE Computer, 29(3), 31–44.

    Article  Google Scholar 

  24. Javed, F., McNair, M., Jacob, F., & Zhao, M. (2016). Towards a job title classification system. arXiv:1606.00917.

  25. Jindal, N., & Liu, B. (2008). Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (pp. 219–230): ACM.

  26. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In Nédellec, C., & Rouveirol, C. (Eds.) Machine Learning: ECML-98, Lecture Notes in Computer Science, (Vol. 1398 pp. 137–142). Berlin: Springer., (Vol. 1398 pp. 137–142).

    Chapter  Google Scholar 

  27. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv:1607.01759.

  28. Kanan, T., & Fox, E.A. (2016). Automated arabic text classification with p-stemmer, machine learning, and a tailored news article taxonomy. JASIST, 67(11), 2667–2683.

    Google Scholar 

  29. Kessler, R., Torres-Moreno, J.M., & El-Bèze, M. (2007). E-gen: automatic job offer processing system for human resources. In Mexican international conference on artificial intelligence (pp. 985–995). Springer.

  30. Koperwas, J., Skonieczny, Ł., Kozłowski, M., Andruszkiewicz, P., Rybiński, H., & Struk, W. (2016). Intelligent information processing for building university knowledge base. Journal of Intelligent Information Systems, 48, 141–163.

    Article  Google Scholar 

  31. Kureková, L. M., Beblavỳ, M., & Thum-Thysen, A. (2015). Using online vacancies and web surveys to analyse the labour market: a methodological inquiry. IZA Journal of Labor Economics, 4(1), 1–20.

    Article  Google Scholar 

  32. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, ICML (Vol. 1 pp. 282–289).

  33. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436–444.

    Article  Google Scholar 

  34. Lee, I. (2011). Modeling the benefit of e-recruiting process integration. Decision Support Systems, 51(1), 230–239.

    Article  Google Scholar 

  35. Lembo, D., Torlone, R., & Marella, A. (Eds.) (2015). In 23rd Italian symposium on advanced database systems, SEBD 2015, Gaeta, Italy, June 14-17, 2015. Curran Associates, Inc. ISBN: 978-1-5108-1087-7.

  36. LFS (2016). Labour force survey. Visited on 2016-11-11.

  37. Lippmann, R. (1987). An introduction to computing with neural nets. IEEE Assp Magazine, 4(2), 4–22.

    Article  Google Scholar 

  38. Marrara, S., Pasi, G., Viviani, M., Cesarini, M., Mercorio, F., Mezzanzanica, M., & Pappagallo, M. (2017). A language modelling approach for discovering novel labour market occupations from the web. In Proceedings of the international conference on web intelligence, Leipzig, Germany, August 23–26, 2017 (pp. 1026-1034).,

  39. Mezzanzanica, M., Boselli, R., Cesarini, M., & Mercorio, F. (2012). Data quality sensitivity analysis on aggregate indicators. In Helfert, M., Francalanci, C., & Filipe, J. (Eds.) Proceedings of the international conference on data technologies and applications, data 2012 (pp. 97–108). INSTICC.

  40. Mezzanzanica, M., Boselli, R., Cesarini, M., & Mercorio, F. (2015). A model-based evaluation of data quality activities in KDD. Information Processing & Management, 51(2), 144–166.

    Article  Google Scholar 

  41. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

  42. Mooney, R.J., & Bunescu, R. (2005). Mining knowledge from text using information extraction. SIGKDD Explorations Newsletter, 7(1), 3–10.

    Article  Google Scholar 

  43. Müller, K. R., Mika, S., Rätsch, G., Tsuda, K., & Schölkopf, B. (2001). An introduction to Kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201.

    Article  Google Scholar 

  44. Nahm, U.Y., & Mooney, R.J. (2001). Mining soft-matching rules from textual data. In Proceedings of the 17th international joint conference on artificial intelligence (Vol. 2 pp. 979984). Morgan Kaufmann Publishers Inc.

  45. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on empirical methods in natural language processing (Vol. 10 pp. 7986). Association for Computational Linguistics.

  46. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  47. Perea-Ortega, J.M., Martín-Valdivia, M.T., Lȯpez, L.A.U., & Martínez-Cȧmara, E. (2013). Improving polarity classification of bilingual parallel corpora combining machine learning and semantic orientation approaches. JASIST, 64 (9), 1864–1877.

    Article  Google Scholar 

  48. Poch, M., Bel, N., Espeja, S., & Navıo, F. (2014). Ranking job offers for candidates: learning hidden knowledge from big data. In Language resources and evaluation conference.

  49. Samuelson, P.A. (1974). Remembrances of frisch. European Economic Review, 5 (1), 7–23.

    Article  Google Scholar 

  50. Sayfullina, L., Malmi, E., Liao, Y., & Jung, A. (2017). Domain adaptation for resume classification using convolutional neural networks. arXiv:1707.05576.

  51. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.

    MathSciNet  Article  Google Scholar 

  52. Segel, E., & Heer, J. (2010). Narrative visualization: telling stories with data. IEEE Transactions on Visualization and Computer Graphics, 16(6), 1139–1148.

    Article  Google Scholar 

  53. Sheth, A.P, Ngonga, A., Wang, Y., Chang, E., Slezak D., Franczyk, B., Alt, R., Tao, X., & Unland, R. (Eds.) (2017). In Proceedings of the international conference on web intelligence, Leipzig, Germany, August 23-26, 2017. ACM. ISBN:978-1-4503-4951-2.

  54. Singh, A., Rose, C., Visweswariah, K., Chenthamarakshan, V., & Kambhatla, N. (2010). Prospect: a system for screening candidates for recruitment. In Proceedings of the 19th ACM international conference on information and knowledge management (pp. 659–668). ACM.

  55. SOC2000 (2016). Visited on 2016-11-11.

  56. Sun, Q., Amin, M., Yan, B., Martell, C., Markman, V., Bhasin, A., & Ye, J. (2015). Transfer learning for bilingual content classification. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2147–2156). ACM.

  57. Tang, D., Qin, B., & Liu, T. (2015). Document modeling with gated recurrent neural network for sentiment classification. In EMNLP (pp. 1422–1432).

  58. Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 384–394). Association for Computational Linguistics.

  59. Vilares, D., Alonso, M.A., & Gȯmez-rodríguez, C. (2015). On the usefulness of lexical and syntactic processing in polarity classification of twitter messages. JASIST, 66(9), 1799–1816.

    Google Scholar 

  60. Viviani, M., & Pasi, G. (2017). Credibility in social media: opinions, news, and health information - a survey. WIREs Data Mining and Knowledge Discovery.

    Google Scholar 

  61. Xu, H., Gu, C., Zhou, H., & Zhang, J. (2017). arXiv:1705.06123.

  62. Yang, Y., & Pedersen, J.O. (1997). A comparative study on feature selection in text categorization. In ICML, (Vol. 97 pp. 412–420).

  63. Yi, X., Allan, J., & Croft, W.B. (2007). Matching resumes and jobs based on relevance models. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 809–810). ACM.

  64. Zhu, C., Zhu, H., Xiong, H., Ding, P., & Xie, F. (2016). Recruitment market trend analysis with sequential latent variable models. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16. (pp. 383–392). New York: ACM.

  65. Zubiaga, A., Spina, D., Martínez-unanue, R., & Fresno, V. (2015). Real-time classification of twitter trends. JASIST, 66(3), 462–473.

    Google Scholar 

Download references


This work was supported by the Cedefop agency as part of the project “Real-time Labor Market information on skill requirements: feasibility study and working prototype”. Cedefop Reference number AO/RPA/VKVET-NSOFRO/Real-time LMI/010/14. Contract notice 2014/S 141-252026 of 15/07/2014.

Author information



Corresponding authors

Correspondence to Fabio Mercorio or Gabriella Pasi.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Boselli, R., Cesarini, M., Marrara, S. et al. WoLMIS: a labor market intelligence system for classifying web job vacancies. J Intell Inf Syst 51, 477–502 (2018).

Download citation


  • Labor market intelligence
  • Text classification
  • Machine learning
  • Knowledge discovery
  • Information systems