In the last decades, an increasing number of employers and job seekers have been relying on Web resources to get in touch and to find a job. If appropriately retrieved and analyzed, the huge number of job vacancies available today on on-line job portals can provide detailed and valuable information about the Web Labor Market dynamics and trends. In particular, this information can be useful to all actors, public and private, who play a role in the European Labor Market. This paper presents WoLMIS, a system aimed at collecting and automatically classifying multilingual Web job vacancies with respect to a standard taxonomy of occupations. The proposed system has been developed for the Cedefop European agency, which supports the development of European Vocational Education and Training (VET) policies and contributes to their implementation. In particular, WoLMIS allows analysts and Labor Market specialists to make sense of Labor Market dynamics and trends of several countries in Europe, by overcoming linguistic boundaries across national borders. A detailed experimental evaluation analysis is also provided for a set of about 2 million job vacancies, collected from a set of UK and Irish Web job sites from June to September 2015.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
The Commission Communication “New Skills for New Jobs” (COM(2008) 868, 16.12.2008)
The Commission Communication “An Agenda for new skills and jobs: A European contribution toward full employment” (COM(2010) 682, 23.11.2010)
The Commission Communication “A New Skills Agenda for Europe” COM(2016) 381/2, available at https://goo.gl/Shw7bI
Real-time Labor Market information on skill requirements: feasibility study and working prototype. Cedefop Reference number AO/RPA/VKVET-NSOFRO/Real-time LMI/010/14. Contract notice 2014/S 141-252026 of 15/07/2014
For more information on SOC2000, the interested reader can refer to SOC2000 (2016).
The previously cited extension of the Standard Occupational Classification (SOC) system developed by the U.S. Bureau of Labor Statistics.
The market in which workers find an employment, employers find available workers, and wage rates are determined.
The European Network on Regional Labor Market Monitoring (ENRLMM 2016).
The European classification system for economical sectors, see http://ec.europa.eu/eurostat/statistics-explained/index.php/Glossary:Statistical_classification_of_economic_activities_in_the_European_Community_(NACE)
Generally speaking, an n-gram is a set of n consecutive words.
The visiting frequency was tuned for each Web site taking into account: the publishing rate, the average time an advertisement is kept on-line, and suggestions of the Web masters who accepted to collaborate with the project.
Actually, there are some vacancies, mostly looking for language teachers.
According to (ISCO 2012), “Water and firewood collectors” gather water and firewood and transport them on foot or using hand or animal carts.
Also known as weighted averaging.
The lower quartile is the 25th percentile while the upper quartile is the 75th percentile.
For an updated list, see https://ec.europa.eu/esco/portal/escopedia/ESCO_languages
Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato, V., Persia, F., & Picariello, A. (2015). Challenge: processing web texts for classifying job offers. In 2015 IEEE international conference on semantic computing (ICSC) (pp. 460–463). https://doi.org/10.1109/ICOSC.2015.7050852.
Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato, V., Persia, F., & Picariello, A. (2015). Classification of job advertisements: a case study. In 23rd Italian symposium on advanced database systems, SEBD 2015, Gaeta, Italy, June 14-17, 2015 (pp. 144–151). http://dblp.uni-trier.de/rec/bib/conf/sebd/AmatoBCMMMPP15.
Andrews, S., Gibson, H., Domdouzis, K., & Akhgar, B. (2016). Creating corroborated crisis reports from social media data through formal concept analysis. Journal of Intelligent Information Systems, 47(2), 287–312. https://doi.org/10.1007/s10844-016-0404-9.
Beblavỳ, M., Fabo, B., & Lenaerts, K. (2016). Skills requirements for the 30 most-frequently advertised occupations in the united states: an analysis based on online vacancy data. Tech. Rep. 132, Centre for European Policy Studies (CEPS). http://ssrn.com/abstract=2749549.
Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In International conference on discovery science (pp. 115). Springer.
Boselli, R., Cesarini, M., Mercorio, F., & Mezzanzanica, M. (2014). Planning meets data cleansing. In The 24th international conference on automated planning and scheduling (ICAPS) (pp. 439–443). http://www.aaai.org/ocs/index.php/ICAPS/ICAPS14/paper/view/7898.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.
Califf, M.E. (1998). Relational learning techniques for natural language information extraction. Ph.D. thesis University of Texas at Austin.
Califf, M.E., & Mooney, R.J. (1999). Relational learning of pattern-match rules for information extraction. In AAAI/IAAI (pp. 328–334).
Carnevale, A.P., Jayasundera, T., & Repnikov, D. (2014). Understanding online job ads data: a technical report. Tech. rep., Georgetown University, McCourt School on Public Policy, Center on Education and the Workforce. https://cew.georgetown.edu/wp-content/uploads/2014/11/OCLM.Tech_.Web_.pdf.
Ceci, M., & Malerba, D. (2007). Classifying web documents in a hierarchy of categories: a comprehensive study. Journal of Intelligent Information Systems, 28(1), 37–78.
Cesarini, M., Mezzanzanica, M., & Fugini, M. (2007). Analysis-sensitive conversion of administrative data into statistical information systems. Journal of Cases on Information Technology, 9(4), 57–81.
Chang, C.C., & Lin, C.J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Crowther, P.S., & Cox, R.J. (2005). A method for optimal division of data sets for use in neural networks. In Khosla, R., Howlett, R.J., & Jain, L.C. (Eds.) 9th International conference on knowledge-based intelligent information and engineering systems, KES 2005, Melbourne, Australia, September 14-16, 2005, Proceedings, Part IV (pp. 1–7). Berlin: Springer. https://doi.org/10.1007/11554028_1
Elias, P., & Purcell, K. (2004). Soc (he): a classification of occupations for studying the graduate labour market. Tech. rep., Institute for Employment Research, University of Warwick, Coventry, UK. http://www2.warwick.ac.uk/fac/soc/ier/research/completed/7yrs2/rp6.pdf.
ENRLMM (2016). The european network on regional labour market monitoring. http://www.regionallabourmarketmonitoring.net/. Visited on 2016-11-11.
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., & Lin, C.J. (2008). Liblinear: a library for large linear classification. The Journal of Machine Learning Research, 9 (Aug), 1871–1874.
Freitag, D., & Kushmerick, N. (2000). Boosted wrapper induction. In AAAI/IAAI (pp. 577–583).
Haykin, S. (1999). A comprehensive foundation of neural networks. Upper Saddle River: Prentice Hall.
Hong, W., Zheng, S., & Wang, H. (2013). Dynamic user profile-based job recommender system. In 2013 8th international conference on computer science & education (ICCSE) (pp. 1499–1503). IEEE.
Hsu, C.W., Chang, C.C., & Lin Chih-Jen, E. (2003). A practical guide to support vector classification. Tech. rep., Department of Computer Science and Information Engineering, National Taiwan University. https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf.
ISCO (2012). International standard classification of Occupations. Visited on 2016-11-11.
Jain, A.K., Mao, J., & Mohiuddin, K.M. (1996). Artificial neural networks: a tutorial. IEEE Computer, 29(3), 31–44.
Javed, F., McNair, M., Jacob, F., & Zhao, M. (2016). Towards a job title classification system. arXiv:1606.00917.
Jindal, N., & Liu, B. (2008). Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (pp. 219–230): ACM.
Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In Nédellec, C., & Rouveirol, C. (Eds.) Machine Learning: ECML-98, Lecture Notes in Computer Science, (Vol. 1398 pp. 137–142). Berlin: Springer. https://doi.org/10.1007/BFb0026683, (Vol. 1398 pp. 137–142).
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv:1607.01759.
Kanan, T., & Fox, E.A. (2016). Automated arabic text classification with p-stemmer, machine learning, and a tailored news article taxonomy. JASIST, 67(11), 2667–2683. https://doi.org/10.1002/asi.23609.
Kessler, R., Torres-Moreno, J.M., & El-Bèze, M. (2007). E-gen: automatic job offer processing system for human resources. In Mexican international conference on artificial intelligence (pp. 985–995). Springer.
Koperwas, J., Skonieczny, Ł., Kozłowski, M., Andruszkiewicz, P., Rybiński, H., & Struk, W. (2016). Intelligent information processing for building university knowledge base. Journal of Intelligent Information Systems, 48, 141–163.
Kureková, L. M., Beblavỳ, M., & Thum-Thysen, A. (2015). Using online vacancies and web surveys to analyse the labour market: a methodological inquiry. IZA Journal of Labor Economics, 4(1), 1–20. https://doi.org/10.1186/s40172-015-0034-4.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, ICML (Vol. 1 pp. 282–289).
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436–444.
Lee, I. (2011). Modeling the benefit of e-recruiting process integration. Decision Support Systems, 51(1), 230–239.
Lembo, D., Torlone, R., & Marella, A. (Eds.) (2015). In 23rd Italian symposium on advanced database systems, SEBD 2015, Gaeta, Italy, June 14-17, 2015. Curran Associates, Inc. ISBN: 978-1-5108-1087-7. http://dblp.uni-trier.de/rec/bib/conf/sebd/2015.
LFS (2016). Labour force survey. http://ec.europa.eu/eurostat/web/microdata/european-union-labour-force-survey Visited on 2016-11-11.
Lippmann, R. (1987). An introduction to computing with neural nets. IEEE Assp Magazine, 4(2), 4–22.
Marrara, S., Pasi, G., Viviani, M., Cesarini, M., Mercorio, F., Mezzanzanica, M., & Pappagallo, M. (2017). A language modelling approach for discovering novel labour market occupations from the web. In Proceedings of the international conference on web intelligence, Leipzig, Germany, August 23–26, 2017 (pp. 1026-1034). http://dblp.uni-trier.de/rec/bib/conf/webi/MarraraPVCMMP17, http://doi.acm.org/10.1145/3106426.3109035.
Mezzanzanica, M., Boselli, R., Cesarini, M., & Mercorio, F. (2012). Data quality sensitivity analysis on aggregate indicators. In Helfert, M., Francalanci, C., & Filipe, J. (Eds.) Proceedings of the international conference on data technologies and applications, data 2012 (pp. 97–108). INSTICC. https://doi.org/10.5220/0004040300970108.
Mezzanzanica, M., Boselli, R., Cesarini, M., & Mercorio, F. (2015). A model-based evaluation of data quality activities in KDD. Information Processing & Management, 51(2), 144–166. https://doi.org/10.1016/j.ipm.2014.07.007 http://www.sciencedirect.com/science/article/pii/S0306457314000673.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Mooney, R.J., & Bunescu, R. (2005). Mining knowledge from text using information extraction. SIGKDD Explorations Newsletter, 7(1), 3–10. https://doi.org/10.1145/1089815.1089817.
Müller, K. R., Mika, S., Rätsch, G., Tsuda, K., & Schölkopf, B. (2001). An introduction to Kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201.
Nahm, U.Y., & Mooney, R.J. (2001). Mining soft-matching rules from textual data. In Proceedings of the 17th international joint conference on artificial intelligence (Vol. 2 pp. 979984). Morgan Kaufmann Publishers Inc.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on empirical methods in natural language processing (Vol. 10 pp. 7986). Association for Computational Linguistics.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Perea-Ortega, J.M., Martín-Valdivia, M.T., Lȯpez, L.A.U., & Martínez-Cȧmara, E. (2013). Improving polarity classification of bilingual parallel corpora combining machine learning and semantic orientation approaches. JASIST, 64 (9), 1864–1877. https://doi.org/10.1002/asi.22884.
Poch, M., Bel, N., Espeja, S., & Navıo, F. (2014). Ranking job offers for candidates: learning hidden knowledge from big data. In Language resources and evaluation conference.
Samuelson, P.A. (1974). Remembrances of frisch. European Economic Review, 5 (1), 7–23.
Sayfullina, L., Malmi, E., Liao, Y., & Jung, A. (2017). Domain adaptation for resume classification using convolutional neural networks. arXiv:1707.05576.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.
Segel, E., & Heer, J. (2010). Narrative visualization: telling stories with data. IEEE Transactions on Visualization and Computer Graphics, 16(6), 1139–1148.
Sheth, A.P, Ngonga, A., Wang, Y., Chang, E., Slezak D., Franczyk, B., Alt, R., Tao, X., & Unland, R. (Eds.) (2017). In Proceedings of the international conference on web intelligence, Leipzig, Germany, August 23-26, 2017. ACM. ISBN:978-1-4503-4951-2.
Singh, A., Rose, C., Visweswariah, K., Chenthamarakshan, V., & Kambhatla, N. (2010). Prospect: a system for screening candidates for recruitment. In Proceedings of the 19th ACM international conference on information and knowledge management (pp. 659–668). ACM.
SOC2000 (2016). http://www.ons.gov.uk/ons/guide-method/classifications/archived-standard-classifications/standard-occupational-classification-2000/index.html. Visited on 2016-11-11.
Sun, Q., Amin, M., Yan, B., Martell, C., Markman, V., Bhasin, A., & Ye, J. (2015). Transfer learning for bilingual content classification. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2147–2156). ACM.
Tang, D., Qin, B., & Liu, T. (2015). Document modeling with gated recurrent neural network for sentiment classification. In EMNLP (pp. 1422–1432).
Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 384–394). Association for Computational Linguistics.
Vilares, D., Alonso, M.A., & Gȯmez-rodríguez, C. (2015). On the usefulness of lexical and syntactic processing in polarity classification of twitter messages. JASIST, 66(9), 1799–1816. https://doi.org/10.1002/asi.23284.
Viviani, M., & Pasi, G. (2017). Credibility in social media: opinions, news, and health information - a survey. WIREs Data Mining and Knowledge Discovery. https://doi.org/10.1002/widm.1209.
Xu, H., Gu, C., Zhou, H., & Zhang, J. (2017). arXiv:1705.06123.
Yang, Y., & Pedersen, J.O. (1997). A comparative study on feature selection in text categorization. In ICML, (Vol. 97 pp. 412–420).
Yi, X., Allan, J., & Croft, W.B. (2007). Matching resumes and jobs based on relevance models. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 809–810). ACM.
Zhu, C., Zhu, H., Xiong, H., Ding, P., & Xie, F. (2016). Recruitment market trend analysis with sequential latent variable models. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16. (pp. 383–392). New York: ACM. https://doi.org/10.1145/2939672.2939689
Zubiaga, A., Spina, D., Martínez-unanue, R., & Fresno, V. (2015). Real-time classification of twitter trends. JASIST, 66(3), 462–473. https://doi.org/10.1002/asi.23186.
This work was supported by the Cedefop agency as part of the project “Real-time Labor Market information on skill requirements: feasibility study and working prototype”. Cedefop Reference number AO/RPA/VKVET-NSOFRO/Real-time LMI/010/14. Contract notice 2014/S 141-252026 of 15/07/2014.
About this article
Cite this article
Boselli, R., Cesarini, M., Marrara, S. et al. WoLMIS: a labor market intelligence system for classifying web job vacancies. J Intell Inf Syst 51, 477–502 (2018). https://doi.org/10.1007/s10844-017-0488-x
- Labor market intelligence
- Text classification
- Machine learning
- Knowledge discovery
- Information systems