Skip to main content

Advertisement

Log in

Supervised machine learning techniques and genetic optimization for occupational diseases risk prediction

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Workers healthcare gained a lot of attention recently as many countries are increasingly concerning about welfare. This paper faces the problem of predicting occupational disease risks by means of computational intelligence and pattern recognition techniques. Specifically, three different machine learning approaches are compared: the first one is based on the k-means algorithm, in charge to determine a set of meaningful labelled clusters as the final model. The latter two are based on fully supervised techniques, namely Support Vector Machines and K-Nearest Neighbours. Real data regarding both the worker and the workplace by mixing numerical and categorical attributes have been used for testing. The three approaches are automatically tuned by means of genetic algorithms in order to simultaneously find the optimal hyperparameters for the classification systems and the optimal ad-hoc dissimilarity measure weights in order to maximize the classification performances. Computational results show that the three approaches are rather comparable in terms of performances, but a clustering-based approach allows a deeper knowledge discovery phase, helpful for further risk assessment and forecasting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Any pre-screening system able to minimize false positives and/or false negatives might have a major impact on (public) costs.

  2. An Italian version of the classification system ISCO (International Standard Classification of Occupations).

  3. An Italian version of the classification system NACE (Nomenclature des Activités Économiques dans la Communauté Européenne).

  4. Recall that the clustering approach also works in a One-Against-All fashion, thus the most popular cluster amongst the ones labelled as ‘sick’ (i.e. having carpal tunnel syndrome) and the most popular cluster amongst the ones labelled as ‘healthy’ (i.e. not having carpal tunnel syndrome) have been selected.

  5. Due to the classification speed, any constraints on prediction running times can be seen as automatically satisfied anyways.

References

  • Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795. https://doi.org/10.1007/s11227-017-2046-2

    Article  Google Scholar 

  • Abualigah LM, Khader AT, Al-Betar MA (2016) Unsupervised feature selection technique based on genetic algorithm for improving the text clustering. In: 2016 7th International conference on computer science and information technology (CSIT), pp 1–6, https://doi.org/10.1109/CSIT.2016.7549453

  • Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. Data Clust Algorithms Appl 29:110–121

    Google Scholar 

  • Bandyopadhyay S, Murthy CA, Pal SK (1995) Pattern classification with genetic algorithms. Pattern Recognit Lett 16(8):801–808

    Article  Google Scholar 

  • Bellazzi R, Zupan B (2008) Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inf 77(2):81–97. https://doi.org/10.1016/j.ijmedinf.2006.11.006

    Article  Google Scholar 

  • Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Proceedings of the 24th international conference on neural information processing systems. Curran Associates Inc., USA, PP 2546–2554

  • Boser BE, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory, ACM, pp 144–152

  • Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

    Article  Google Scholar 

  • Cheng CW, Leu SS, Cheng YM, Wu TC, Lin CC (2012) Applying data mining techniques to explore factors contributing to occupational injuries in Taiwan’s construction industry. Accid Anal Prev 48:214–222. https://doi.org/10.1016/j.aap.2011.04.014

    Article  Google Scholar 

  • Cheng CW, Yao HQ, Wu TC (2013) Applying data mining techniques to analyze the causes of major occupational accidents in the petrochemical industry. J Loss Prev Process Ind 26(6):1269–1278. https://doi.org/10.1016/j.jlp.2013.07.002

    Article  Google Scholar 

  • Ciarapica F, Giacchetta G (2009) Classification and prediction of occupational injury risk using soft computing techniques: an Italian study. Saf Sci 47(1):36–49. https://doi.org/10.1016/j.ssci.2008.01.006

    Article  Google Scholar 

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. https://doi.org/10.1007/BF00994018

    Article  MATH  Google Scholar 

  • Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  • De Santis E, Martino A, Rizzi A, Frattale Mascioli FM (2018) Dissimilarity space representations and automatic feature selection for protein function prediction. In: 2018 International joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2018.8489115

  • Del Vescovo G, Livi L, Frattale Mascioli FM, Rizzi A (2014) On the problem of modeling structured data with the minsod representative. Int J Comput Theory Eng 6(1):9

    Article  Google Scholar 

  • Di Noia A, Montanari P, Rizzi A (2014) Occupational diseases risk prediction by cluster analysis and genetic optimization. In: Proceedings of the international conference on evolutionary computation theory and applications: ECTA, (IJCCI 2014), INSTICC, vol 1. SciTePress, pp 68–75, https://doi.org/10.5220/0005077800680075

  • Di Noia A, Montanari P, Rizzi A (2016) Occupational diseases risk prediction by genetic optimization: towards a non-exclusive classification approach. Springer, Cham, pp 63–77. https://doi.org/10.1007/978-3-319-26393-9_5

    Book  Google Scholar 

  • Filho DV, dos Santos MA, Ludermir TB, Silva MJ (2002) A fuzzy approach to support a musculoskeletal disorders diagnosis. In: Proceedings on 7th Brazilian symposium on neural networks, 2002. SBRN 2002, p 154, https://doi.org/10.1109/SBRN.2002.1181461

  • Frasca F, Matteucci M, Masseroli M, Morelli M (2018) Modeling gene transcriptional regulation by means of hyperplanes genetic clustering. In: 2018 International joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2018.8489054

  • Freitas AA (2002) Evolutionary algorithms for clustering. Springer, Berlin, pp 165–178. https://doi.org/10.1007/978-3-662-04923-5_8

    Book  Google Scholar 

  • Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning, 1st edn. Addison-Wesley, Boston

    MATH  Google Scholar 

  • Goldberg DE, Holland JH (1988) Genetic algorithms and machine learning. Mach Learn 3(2):95–99. https://doi.org/10.1023/A:1022602019183

    Article  Google Scholar 

  • Hruschka ER, Campello RJGB, Freitas AA, de Carvalho ACPLF (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man Cybern Part C (Appl Rev) 39(2):133–155. https://doi.org/10.1109/TSMCC.2008.2007252

    Article  Google Scholar 

  • Huang Z, Yu D, Zhao J (2000) Application of neural networks with linear and nonlinear weights in occupational disease incidence forecast. In: The 2000 IEEE Asia-Pacific conference on circuits and systems, 2000. IEEE APCCAS 2000, pp 383–386, https://doi.org/10.1109/APCCAS.2000.913515

  • Lavrač N (1999) Selected techniques for data mining in medicine. Artif Intell Med 16(1):3–23. https://doi.org/10.1016/S0933-3657(98)00062-1

    Article  MathSciNet  Google Scholar 

  • Lessmann S, Stahlbock R, Crone SF (2005) Optimizing hyperparameters of support vector machines by genetic algorithms. In: IC-AI, pp 74–82

  • Liao CW, Perng YH (2008) Data mining for occupational injuries in the Taiwan construction industry. Saf Sci 46(7):1091–1102. https://doi.org/10.1016/j.ssci.2007.04.007

    Article  Google Scholar 

  • Liew AWC, Yan H, Yang M (2005) Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognit 38(11):2055–2073. https://doi.org/10.1016/j.patcog.2005.02.019

    Article  Google Scholar 

  • Lin SW, Ying KC, Chen SC, Lee ZJ (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35(4):1817–1824. https://doi.org/10.1016/j.eswa.2007.08.088

    Article  Google Scholar 

  • Liu H, Tang Z, Yang Y, Weng D, Sun G, Duan Z, Chen J (2009) Identification and classification of high risk groups for coal workers’ pneumoconiosis using an artificial neural network based on occupational histories: a retrospective cohort study. BMC Public Health 9(1):366. https://doi.org/10.1186/1471-2458-9-366

    Article  Google Scholar 

  • Livi L, Rizzi A (2013) Graph ambiguity. Fuzzy Sets Syst 221:24–47. https://doi.org/10.1016/j.fss.2013.01.001

    Article  MathSciNet  MATH  Google Scholar 

  • Livi L, Del Vescovo G, Rizzi A (2012) Graph recognition by seriation and frequent substructures mining. In: Proceedings of the 1st international conference on pattern recognition applications and methods: ICPRAM,, INSTICC, vol 1, SciTePress, pp 186–191, https://doi.org/10.5220/0003733201860191

  • Livi L, Del Vescovo G, Rizzi A, Frattale Mascioli FM (2014) Building pattern recognition applications with the spare library. arXiv preprint arXiv:14105263

  • Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability: statistics, vol 1. University of California Press, Berkeley, pp 281–297

  • Martiniano A, Ferreira RP, Sassi RJ, Affonso C (2012) Application of a neuro fuzzy network in prediction of absenteeism at work. In: 2012 7th Iberian conference on information systems and technologies (CISTI), pp 1–4

  • Martino A, Maiorino E, Giuliani A, Giampieri M, Rizzi A (2017a) Supervised approaches for function prediction of proteins contact networks from topological structure information. Springer, Cham, pp 285–296. https://doi.org/10.1007/978-3-319-59126-1_24

    Book  Google Scholar 

  • Martino A, Rizzi A, Frattale Mascioli FM (2017b) Efficient approaches for solving the large-scale k-medoids problem. In: Proceedings of the 9th international joint conference on computational intelligence: IJCCI,, INSTICC, vol 1. SciTePress, pp 338–347, https://doi.org/10.5220/0006515003380347

  • Martino A, Giuliani A, Rizzi A (2018a) Granular computing techniques for bioinformatics pattern recognition problems in non-metric spaces. Springer, Cham, pp 53–81. https://doi.org/10.1007/978-3-319-89629-8_3

    Book  Google Scholar 

  • Martino A, Rizzi A, Frattale Mascioli FM (2018b) Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering. In: 2018 International joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2018.8489101

  • Martino A, Rizzi A, Frattale Mascioli FM (2018c) Supervised approaches for protein function prediction by topological data analysis. In: 2018 International joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2018.8489307

  • Martino A, Rizzi A, Frattale Mascioli FM (2019) Efficient approaches for solving the large-scale k-medoids problem: towards structured data. In: Sabourin C, Merelo J, Madani K, Warwick K (eds) Computational intelligence: 9th international joint conference, IJCCI 2017 Funchal-Madeira, Portugal, November 1–3, 2017 Revised Selected Papers. Springer International Publishing, Cham, pp 199–219. https://doi.org/10.1007/978-3-030-16469-0_11

    Chapter  Google Scholar 

  • Meissner M, Schmuker M, Schneider G (2006) Optimized particle swarm optimization (opso) and its application to artificial neural network training. BMC Bioinform 7(1):125. https://doi.org/10.1186/1471-2105-7-125

    Article  Google Scholar 

  • Mukherjee C, Gupta K, Nallusamy R (2012) A decision support system for employee healthcare. In: 2012 3rd International conference on services in emerging markets (ICSEM), pp 130–135, https://doi.org/10.1109/ICSEM.2012.25

  • Murdoch TB, Detsky AS (2013) The inevitable application of big data to health care. JAMA 309(13):1351–1352. https://doi.org/10.1001/jama.2013.393

    Article  Google Scholar 

  • Orive D, Sorrosal G, Borges C, Martín C, Alonso-Vicario A (2014) Evolutionary algorithms for hyperparameter tuning on neural networks models. In: Proceedings of the 26th european modeling & simulation symposium. Burdeos, France, pp 402–409

  • Paul R, Hoque ASML (2010) Clustering medical data to predict the likelihood of diseases. In: 2010 5th International conference on digital information management (ICDIM), pp 44–49, https://doi.org/10.1109/ICDIM.2010.5664638

  • Pei M, Goodman ED, Punch WF, Ding Y (1995) Genetic algorithms for classification and feature extraction. In: Classification Society Conference, pp 1–28

  • Powers DMW (2011) Evaluation: from precision, recall and f-measure to roc., informedness, markedness & correlation. J Mach Learn Technol 2(1):37–63

    MathSciNet  Google Scholar 

  • de Ridder D, de Ridder J, Reinders MJT (2013) Pattern recognition in bioinformatics. Brief Bioinform 14(5):633–647. https://doi.org/10.1093/bib/bbt020

    Article  Google Scholar 

  • Rizzi A, Del Vescovo G (2006) Automatic image classification by a granular computing approach. In: 2006 16th IEEE signal processing society workshop on machine learning for signal processing, pp 33–38, https://doi.org/10.1109/MLSP.2006.275517

  • Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5):1207–1245

    Article  Google Scholar 

  • Srinivas K, Rao GR, Govardhan A (2010) Analysis of coronary heart disease and prediction of heart attack in coal mining regions using data mining techniques. In: 2010 5th International conference on computer science education, pp 1344–1349, https://doi.org/10.1109/ICCSE.2010.5593711

  • Tsai JT, Chou JH, Liu TK (2006) Tuning the structure and parameters of a neural network by using hybrid Taguchi-genetic algorithm. IEEE Trans Neural Netw 17(1):69–80

    Article  Google Scholar 

  • Vapnik V (1998) Statistical Learning Theory. Wiley, New York

    MATH  Google Scholar 

  • Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang JF, Hua L (2012) Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst 36(4):2431–2448. https://doi.org/10.1007/s10916-011-9710-5

    Article  Google Scholar 

  • Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35

    Article  Google Scholar 

  • Yuan C, Li G, Peihong Z, Li C (2010) Artificial neural network modeling of prevalence of pneumoconiosis among workers in metallurgical industry—a case study. In: 2010 International conference on intelligent system design and engineering application (ISDEA), vol 1, pp 388–393, https://doi.org/10.1109/ISDEA.2010.111

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessio Martino.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical standard

This work does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Di Noia, A., Martino, A., Montanari, P. et al. Supervised machine learning techniques and genetic optimization for occupational diseases risk prediction. Soft Comput 24, 4393–4406 (2020). https://doi.org/10.1007/s00500-019-04200-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-04200-2

Keywords

Navigation