Abstract
Workers healthcare gained a lot of attention recently as many countries are increasingly concerning about welfare. This paper faces the problem of predicting occupational disease risks by means of computational intelligence and pattern recognition techniques. Specifically, three different machine learning approaches are compared: the first one is based on the k-means algorithm, in charge to determine a set of meaningful labelled clusters as the final model. The latter two are based on fully supervised techniques, namely Support Vector Machines and K-Nearest Neighbours. Real data regarding both the worker and the workplace by mixing numerical and categorical attributes have been used for testing. The three approaches are automatically tuned by means of genetic algorithms in order to simultaneously find the optimal hyperparameters for the classification systems and the optimal ad-hoc dissimilarity measure weights in order to maximize the classification performances. Computational results show that the three approaches are rather comparable in terms of performances, but a clustering-based approach allows a deeper knowledge discovery phase, helpful for further risk assessment and forecasting.
Similar content being viewed by others
Notes
Any pre-screening system able to minimize false positives and/or false negatives might have a major impact on (public) costs.
An Italian version of the classification system ISCO (International Standard Classification of Occupations).
An Italian version of the classification system NACE (Nomenclature des Activités Économiques dans la Communauté Européenne).
Recall that the clustering approach also works in a One-Against-All fashion, thus the most popular cluster amongst the ones labelled as ‘sick’ (i.e. having carpal tunnel syndrome) and the most popular cluster amongst the ones labelled as ‘healthy’ (i.e. not having carpal tunnel syndrome) have been selected.
Due to the classification speed, any constraints on prediction running times can be seen as automatically satisfied anyways.
References
Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795. https://doi.org/10.1007/s11227-017-2046-2
Abualigah LM, Khader AT, Al-Betar MA (2016) Unsupervised feature selection technique based on genetic algorithm for improving the text clustering. In: 2016 7th International conference on computer science and information technology (CSIT), pp 1–6, https://doi.org/10.1109/CSIT.2016.7549453
Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. Data Clust Algorithms Appl 29:110–121
Bandyopadhyay S, Murthy CA, Pal SK (1995) Pattern classification with genetic algorithms. Pattern Recognit Lett 16(8):801–808
Bellazzi R, Zupan B (2008) Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inf 77(2):81–97. https://doi.org/10.1016/j.ijmedinf.2006.11.006
Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Proceedings of the 24th international conference on neural information processing systems. Curran Associates Inc., USA, PP 2546–2554
Boser BE, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory, ACM, pp 144–152
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cheng CW, Leu SS, Cheng YM, Wu TC, Lin CC (2012) Applying data mining techniques to explore factors contributing to occupational injuries in Taiwan’s construction industry. Accid Anal Prev 48:214–222. https://doi.org/10.1016/j.aap.2011.04.014
Cheng CW, Yao HQ, Wu TC (2013) Applying data mining techniques to analyze the causes of major occupational accidents in the petrochemical industry. J Loss Prev Process Ind 26(6):1269–1278. https://doi.org/10.1016/j.jlp.2013.07.002
Ciarapica F, Giacchetta G (2009) Classification and prediction of occupational injury risk using soft computing techniques: an Italian study. Saf Sci 47(1):36–49. https://doi.org/10.1016/j.ssci.2008.01.006
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. https://doi.org/10.1007/BF00994018
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
De Santis E, Martino A, Rizzi A, Frattale Mascioli FM (2018) Dissimilarity space representations and automatic feature selection for protein function prediction. In: 2018 International joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2018.8489115
Del Vescovo G, Livi L, Frattale Mascioli FM, Rizzi A (2014) On the problem of modeling structured data with the minsod representative. Int J Comput Theory Eng 6(1):9
Di Noia A, Montanari P, Rizzi A (2014) Occupational diseases risk prediction by cluster analysis and genetic optimization. In: Proceedings of the international conference on evolutionary computation theory and applications: ECTA, (IJCCI 2014), INSTICC, vol 1. SciTePress, pp 68–75, https://doi.org/10.5220/0005077800680075
Di Noia A, Montanari P, Rizzi A (2016) Occupational diseases risk prediction by genetic optimization: towards a non-exclusive classification approach. Springer, Cham, pp 63–77. https://doi.org/10.1007/978-3-319-26393-9_5
Filho DV, dos Santos MA, Ludermir TB, Silva MJ (2002) A fuzzy approach to support a musculoskeletal disorders diagnosis. In: Proceedings on 7th Brazilian symposium on neural networks, 2002. SBRN 2002, p 154, https://doi.org/10.1109/SBRN.2002.1181461
Frasca F, Matteucci M, Masseroli M, Morelli M (2018) Modeling gene transcriptional regulation by means of hyperplanes genetic clustering. In: 2018 International joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2018.8489054
Freitas AA (2002) Evolutionary algorithms for clustering. Springer, Berlin, pp 165–178. https://doi.org/10.1007/978-3-662-04923-5_8
Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning, 1st edn. Addison-Wesley, Boston
Goldberg DE, Holland JH (1988) Genetic algorithms and machine learning. Mach Learn 3(2):95–99. https://doi.org/10.1023/A:1022602019183
Hruschka ER, Campello RJGB, Freitas AA, de Carvalho ACPLF (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man Cybern Part C (Appl Rev) 39(2):133–155. https://doi.org/10.1109/TSMCC.2008.2007252
Huang Z, Yu D, Zhao J (2000) Application of neural networks with linear and nonlinear weights in occupational disease incidence forecast. In: The 2000 IEEE Asia-Pacific conference on circuits and systems, 2000. IEEE APCCAS 2000, pp 383–386, https://doi.org/10.1109/APCCAS.2000.913515
Lavrač N (1999) Selected techniques for data mining in medicine. Artif Intell Med 16(1):3–23. https://doi.org/10.1016/S0933-3657(98)00062-1
Lessmann S, Stahlbock R, Crone SF (2005) Optimizing hyperparameters of support vector machines by genetic algorithms. In: IC-AI, pp 74–82
Liao CW, Perng YH (2008) Data mining for occupational injuries in the Taiwan construction industry. Saf Sci 46(7):1091–1102. https://doi.org/10.1016/j.ssci.2007.04.007
Liew AWC, Yan H, Yang M (2005) Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognit 38(11):2055–2073. https://doi.org/10.1016/j.patcog.2005.02.019
Lin SW, Ying KC, Chen SC, Lee ZJ (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35(4):1817–1824. https://doi.org/10.1016/j.eswa.2007.08.088
Liu H, Tang Z, Yang Y, Weng D, Sun G, Duan Z, Chen J (2009) Identification and classification of high risk groups for coal workers’ pneumoconiosis using an artificial neural network based on occupational histories: a retrospective cohort study. BMC Public Health 9(1):366. https://doi.org/10.1186/1471-2458-9-366
Livi L, Rizzi A (2013) Graph ambiguity. Fuzzy Sets Syst 221:24–47. https://doi.org/10.1016/j.fss.2013.01.001
Livi L, Del Vescovo G, Rizzi A (2012) Graph recognition by seriation and frequent substructures mining. In: Proceedings of the 1st international conference on pattern recognition applications and methods: ICPRAM,, INSTICC, vol 1, SciTePress, pp 186–191, https://doi.org/10.5220/0003733201860191
Livi L, Del Vescovo G, Rizzi A, Frattale Mascioli FM (2014) Building pattern recognition applications with the spare library. arXiv preprint arXiv:14105263
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability: statistics, vol 1. University of California Press, Berkeley, pp 281–297
Martiniano A, Ferreira RP, Sassi RJ, Affonso C (2012) Application of a neuro fuzzy network in prediction of absenteeism at work. In: 2012 7th Iberian conference on information systems and technologies (CISTI), pp 1–4
Martino A, Maiorino E, Giuliani A, Giampieri M, Rizzi A (2017a) Supervised approaches for function prediction of proteins contact networks from topological structure information. Springer, Cham, pp 285–296. https://doi.org/10.1007/978-3-319-59126-1_24
Martino A, Rizzi A, Frattale Mascioli FM (2017b) Efficient approaches for solving the large-scale k-medoids problem. In: Proceedings of the 9th international joint conference on computational intelligence: IJCCI,, INSTICC, vol 1. SciTePress, pp 338–347, https://doi.org/10.5220/0006515003380347
Martino A, Giuliani A, Rizzi A (2018a) Granular computing techniques for bioinformatics pattern recognition problems in non-metric spaces. Springer, Cham, pp 53–81. https://doi.org/10.1007/978-3-319-89629-8_3
Martino A, Rizzi A, Frattale Mascioli FM (2018b) Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering. In: 2018 International joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2018.8489101
Martino A, Rizzi A, Frattale Mascioli FM (2018c) Supervised approaches for protein function prediction by topological data analysis. In: 2018 International joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2018.8489307
Martino A, Rizzi A, Frattale Mascioli FM (2019) Efficient approaches for solving the large-scale k-medoids problem: towards structured data. In: Sabourin C, Merelo J, Madani K, Warwick K (eds) Computational intelligence: 9th international joint conference, IJCCI 2017 Funchal-Madeira, Portugal, November 1–3, 2017 Revised Selected Papers. Springer International Publishing, Cham, pp 199–219. https://doi.org/10.1007/978-3-030-16469-0_11
Meissner M, Schmuker M, Schneider G (2006) Optimized particle swarm optimization (opso) and its application to artificial neural network training. BMC Bioinform 7(1):125. https://doi.org/10.1186/1471-2105-7-125
Mukherjee C, Gupta K, Nallusamy R (2012) A decision support system for employee healthcare. In: 2012 3rd International conference on services in emerging markets (ICSEM), pp 130–135, https://doi.org/10.1109/ICSEM.2012.25
Murdoch TB, Detsky AS (2013) The inevitable application of big data to health care. JAMA 309(13):1351–1352. https://doi.org/10.1001/jama.2013.393
Orive D, Sorrosal G, Borges C, Martín C, Alonso-Vicario A (2014) Evolutionary algorithms for hyperparameter tuning on neural networks models. In: Proceedings of the 26th european modeling & simulation symposium. Burdeos, France, pp 402–409
Paul R, Hoque ASML (2010) Clustering medical data to predict the likelihood of diseases. In: 2010 5th International conference on digital information management (ICDIM), pp 44–49, https://doi.org/10.1109/ICDIM.2010.5664638
Pei M, Goodman ED, Punch WF, Ding Y (1995) Genetic algorithms for classification and feature extraction. In: Classification Society Conference, pp 1–28
Powers DMW (2011) Evaluation: from precision, recall and f-measure to roc., informedness, markedness & correlation. J Mach Learn Technol 2(1):37–63
de Ridder D, de Ridder J, Reinders MJT (2013) Pattern recognition in bioinformatics. Brief Bioinform 14(5):633–647. https://doi.org/10.1093/bib/bbt020
Rizzi A, Del Vescovo G (2006) Automatic image classification by a granular computing approach. In: 2006 16th IEEE signal processing society workshop on machine learning for signal processing, pp 33–38, https://doi.org/10.1109/MLSP.2006.275517
Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5):1207–1245
Srinivas K, Rao GR, Govardhan A (2010) Analysis of coronary heart disease and prediction of heart attack in coal mining regions using data mining techniques. In: 2010 5th International conference on computer science education, pp 1344–1349, https://doi.org/10.1109/ICCSE.2010.5593711
Tsai JT, Chou JH, Liu TK (2006) Tuning the structure and parameters of a neural network by using hybrid Taguchi-genetic algorithm. IEEE Trans Neural Netw 17(1):69–80
Vapnik V (1998) Statistical Learning Theory. Wiley, New York
Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang JF, Hua L (2012) Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst 36(4):2431–2448. https://doi.org/10.1007/s10916-011-9710-5
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Yuan C, Li G, Peihong Z, Li C (2010) Artificial neural network modeling of prevalence of pneumoconiosis among workers in metallurgical industry—a case study. In: 2010 International conference on intelligent system design and engineering application (ISDEA), vol 1, pp 388–393, https://doi.org/10.1109/ISDEA.2010.111
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical standard
This work does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Di Noia, A., Martino, A., Montanari, P. et al. Supervised machine learning techniques and genetic optimization for occupational diseases risk prediction. Soft Comput 24, 4393–4406 (2020). https://doi.org/10.1007/s00500-019-04200-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-04200-2