1 Introduction

As per the report given by the International Labour Organization (ILO), a total of 2.3 million people died globally in a year because of occupational incidents and diseases including 0.36 million cases of fatalities [1]. Nearly 4% of the total gross domestic product is drained off because of the occupational incidents [2]. In Europe, it is reported by the European Statistical Office (EUROSTAT) that about 3.2% of workers face an incident at work in the European Union [3]. Behind each of the incidents, there is a chain of multiple factors interacting with each other in a specific pattern. If the pattern is identified, the incident outcomes can be predicted. Once the outcomes are predicted, the occurrence of incidents can be minimized. A predictive model, in such circumstances, is playing a key role by identifying the inherent patterns and subsequently predicting the outcomes. Therefore, the use of the predictive model is utmost important in incident analysis and prevention.

In practice, once an incident is taken place, safety professionals narrate the incident event in their own language and log them into the electronic database. Therefore, the exactness of the incident mostly depends on the experience and quality of writing of the personnel who logs information into the system. It is often found that the incident narratives, which are in the form of unstructured texts, remain underutilized or sometimes unutilized since the proper utilization of these unstructured text data for information retrieval demands an extensive human effort. Reviewing the incident narratives during the investigation is extremely time-consuming. In addition, narratives are sometimes written in such a way that useful information can hardly be effectively retrieved, and thus analyzed. In fact, analysis of this kind of data is so difficult that the inherent information in incident texts are mostly ignored, which may result in a biased decision-making. On top of that, a huge amount of unstructured texts along with categorical, numerical or other forms of data collected at industry level put the decision-making in a challenging situation, particularly in terms of prediction. In order to resolve the issue, a number of practitioners and academic researchers have put a lot of efforts by employing machine learning (ML) algorithms for prediction of incidents. In ML, there are predominantly two kinds of approaches used for the purpose of classification: (i) non-tree-based approaches, such as support vector machine (SVM) [4], k-nearest neighbor (k-NN) [5], artificial neural network (ANN) [6, 7], Bayesian Network (BN) [8], and (ii) tree-based approaches, for example, decision tree (DT).

To exemplify, Sorock et al. [9] used 3,686 insurance claims for the analysis of crash incidents. In their experiment, keywords from the accident narratives were used in order to identify the types of pre-crash vehicle activities and types of crash incidents. Lehto & Sorock [10] used Bayesian model to perform similar kind of work by identifying the pre-crash activities and crash types from incident narratives. The experiment showed that the model could learn from a computer search for 63 key terms pertaining ro incident categories. Wellman et al. [11] used fuzzy Bayesian model to classify injury narratives into 13 external causes of injury and poisoning categories. In a similar vein, Noorinaeini & Lehto [12] also used two singular value decomposition (SVD)-Bayesian models and one SVD-regression model to classify injury narratives into external causes of injury and poisoning categories. Their experiments explored that all the three models were capable of learning from human knowledge for classification. In 2007, a notable study by Pons-Porrata et al. [13] showed the development of a topic discovery system based on a new incremental hierarchical clustering algorithm and Testor Theory to extract and classify the implicit knowledge in news streams. Experimental results showed its usefulness and effectiveness in not only topic detection, but also in classification and summarization tool. Brooks [14] used SAS Text Miner to mine free texts of workers’ compensation claims and classify into two categories. Their experimental results suggested that text mining can be used as a stand-alone tool for free-text analysis. Fan & Li [15] used text mining to retrieve historical cases from a case library. They showed that natural language-based case document retrieval is superior to the case-based reasoning and more practical for implementation in construction sites. Abdat et al. [16] used Bayesian Network-based model to extract recurrent occupational accident movement with movement disturbances (OAMD) scenarios from narratives. Using this approach, a total of eight scenarios were extracted to describe 143 OAMDs in the construction and metallurgy sectors. In 2014, Sanchez-Pi et al. [17] used ontology-based automatic text classification from unstructured texts in an oil and gas industry. Their proposed approach included text analysis, recognition, and classification of failed occupational health control. Later, in 2016, they extended their ontological concept and made it more domain-dependent [18] for oil and gas industry. Goh & Ubeynarayana [19] used text mining classification techniques to classify a total of 1000 publicly available construction accident narratives into two categories, accident and near miss cases. They employed six machine learning algorithms, namely k-nearest neighbor (KNN), support vector machine (SVM), linear regression (LR), decision tree (DT), random forest (RF), and Naive Bayes (NB). Experimental results showed that SVM is the best algorithm in classification of 251 cases. In addition, it was found that the unigram tokenization with linear SVM performs the best. Zhang et al. [20] used Deep Belief Network (DBN) and Long Short-Term Memory (LSTM) methods on three million accident-related tweets to classify traffic accidents. From the experiments, it is explored that DBN outperforms SVM and supervised Latent Dirichlet allocation (sLDA). Song & Suh [21] used patent analysis using latent Dirichlet allocation (LDA), for extraction of the latent topics and main keywords contained in documents, and network analysis for monitoring change patterns and relations to identify the trends in technology development that prevent the risks of various industrial systems. Apart from these, LDA has been used in different areas, including for pattern extraction from OSHA databases [22], construction reports [23], and investigation reports generated from manufacturing plants [24, 25]. All the reports are prepared in natural languages. Therefore, natural language processing (NLP) is an essential task for accident analysis for the extraction of useful information hidden in texts. Brown [26] used NLP for identification of the contributors to rail accidents from accident narratives and implemented random forest (RF) to check the predictive power of the contributors toward the accident occurrences. In addition, Nenonen [27], in his study, also mentioned that useful information may also be obtained if injury narratives or incident reports are analyzed properly. Moreover, there are a few more interesting applications using DT found in the refinery industry [28], the petrochemical industry [29], railway [30, 31], road [32], and the aviation industry [33]. In summary, it is often seen that the unstructured texts are very important source of information; however, they remain often under-utilized or sometimes unutilized.

These algorithms have been used effectively in different application domains, including shipbuilding [6], mining [34], construction [34], and service [35]. Of them, ANN is found to be very effective due to the inherent features, for example, the learning ability from data, parallel operation, distributed memory, fault tolerance, etc. It is used in a wide spectrum of application domains. For examples, ANN was successfully used with the backward algorithm (BA-ANN) in the prediction of an outburst of coal and gas by He et al. [36]. It was also used to develop an advanced detection system for the prediction of the rating of worker’s health in a hot as well as humid conditions in construction sites [37]. The approaches like ANN and SVM attain widespread popularity since they hold a strong theoretical underpinning enabling us in dealing with the complexity of the problem, learning from the historical information, and more importantly, exploring adaptability with nonparametric theory. However, the interpretation of SVM and ANN model is rather difficult.

Although the aforementioned ML algorithms including ANN [25], DT [38], SVM [39] have been used in frequent in accident analysis; however, these algorithms are only capable enough to effectively utilize the existing attributes of the dataset for prediction of accident occurrences. Nevertheless, there exist a number of hidden attributes or factors within the dataset in a different form of data, which can be hardly retrieved and used by the conventional ML approaches [40]. For instance, hidden attributes can be characterized as ones that require linear or nonlinear transformation. The deep learning method, in this case, can be a better choice for the researchers to obtain information from the data, including the attributes present in either explicit or hidden form. The approach utilizes a number of hidden layers to extract the hidden factors underlying within data to better predict the incident outcomes [40]. Deep neural network (DNN), in this case, works extremely well (in particular NLP-based analysis) [41, 42] as it offers advantages of efficient generation of new features from raw data and accurate classification of feature vectors [43].

DNN, proposed in the early 1980s [44] and revamped in 2002 [45], faced training difficulties initially in deep architectures. Later, it was used in a broad spectrum of application areas, including fraud detection [40], dynamic planning of public bicycle-sharing system [46], time series prediction [47], Spark-based computation [48, 49], pattern recognition [50], speech recognition [51], classification [52, 53], image processing [54, 55], and video processing [41]. The main characteristic of this approach is that it can show better classification performance in the case of analysis of complex and a large amount of data [56,57,58]. More importantly, due to its basic structure, it can perform superior classification tasks. Typically, it consists of a pre-defined number of layers of cascaded auto-encoder (AE) and a softmax classifier [59], which basically helps DNN to produce joint advantages of efficient attribute generations and accurate classification. These characteristics of DNN help to obtain more advantages than other conventional classification algorithms available in the literature.

There are a large number of optimization-based approaches available in the array of literature. Most of them are not found useful in training a DNN structure due to its aforementioned problems. However, a number of optimization techniques have been proposed recently to deal with the complexity inherent to machine learning approaches, including training of a DNN structure. Of them, the optimization algorithms, for example, gradient descent (GD) [60, 61], stochastic gradient descent (SGD) [62], and conjugate gradient [63] are found to be very useful. In GD approach, the algorithm can easily be used for linear systems. However, it is not usually recommended in the case of high dimensional search space of the optimization task, where a number of local minima exist. Hinton and Salakhutdinov [60] suggested that the GD approach can be useful for training a DNN architecture in such a case where the optimization parameters like weights are initialized with the values close to an optimum solution. In fact, the condition is very difficult to be fulfilled and consequently, this algorithm gets trapped into local minima. Moreover, the speed of convergence of this algorithm is found to be very slow while dealing with a large dataset. In case of high dimensional optimization problems, stochastic GD algorithm is frequently used due to its faster rate of convergence, and easy implementation. Another optimization algorithm, namely root mean square propagation (RMSProp), in this case, produces better performance since it uses the average of the second moments of the gradients (the uncentered variance) [64]. Moreover, a comprehensive review by Ruder suggested that adaptive moment estimation (ADAM) comparatively superior to RMSProp due to its better bias-correction procedure [65].

Based on the review, some issues are identified in occupational incident prediction and analysis domain, which are summarized below.

  1. (i)

    Hidden information of unstructured text (i.e., brief description) is not used for the extraction of the incident pattern.

  2. (ii)

    In the domain of incident prediction and analysis, the use of unstructured data (i.e., text) along with categorical data is very little.

  3. (iii)

    Very little use of the deep neural network (DNN) for the incident prediction using both structured and unstructured data.

  4. (iv)

    Parameter optimization using adaptive moment estimation (ADAM) is untouched in DNN.

Therefore, to address these issues, our present study endeavors to contribute as follows:

  1. (i)

    Unstructured text (e.g., incident narratives) along with structured data (i.e., categorical predictor attributes) has been used together to extract the feature vector from each sample corresponding to each occupational incident. All feature vectors constitute together to make the feature map.

  2. (ii)

    K-modes algorithm, missing value handling, and class imbalance handling are adopted to obtain the noise-free feature map.

  3. (iii)

    ADAM-based DNN is developed for the prediction of the occupational incident.

In this paper, we propose adaptive moment estimation (ADAM)-based DNN (i.e., ADNN) classifier for the prediction of incident outcomes using incident data collected from a steel manufacturing plant. The motivation behind the use of this optimization algorithm on DNN is based on its few advantages. For example, it demands less memory can be used easily and efficiently. In our study, it has been demonstrated that the proposed approach is better than other optimized DNN classifiers, namely RMSProp and SGD-based DNN. In addition, the classification performance of the proposed approach has been tested on five different benchmark datasets from the literature. Finally, a comparative study has been made between the performance of ADNN classifier and that of the other state-of-the-art classifiers, namely SVM, k-NN, and RF to explore the efficiency of our proposed strategy in incident prediction.

The remainder of the paper is structured in the following way: In Sect. 2, the proposed methodology has been discussed in brief. In Sect. 3, a case study is provided with data collection, data description, and data preprocessing steps. Results and discussions of the analyses are presented with the statistical tests in Sect. 4, and finally, in Sect. 5, the conclusion is drawn with the scopes of future studies.

2 Methodology

The proposed methodology comprises four phases which is displayed in Fig. 1.

Fig. 1
figure 1

Proposed research methodological flowchart

In Phase-I, common data pre-processing tasks, namely topic modeling, missing data handling, class imbalance handling, outlier handling, and data transformation are performed. In the next phase, i.e., Phase-II, three optimizers, namely SGD, RMSProp, and ADAM are used to tune the parameters of DNN algorithm. In Phase-III, the best classifier is selected through comparative study and in the final phase, i.e., Phase-IV, prediction of incident outcomes using the best classifier is performed. Initially, our models have been over-fitted on the test data. The training accuracies have been found much higher than testing accuracies. To overcome this shortcoming, hyper-parameter tuning of the models has been done using 10-fold cross validation on training dataset following some earlier studies [66,67,68]. Testing dataset has been used only for the model evaluation. In this cross validation process, first, the dataset has been shuffled randomly. Then, the dataset has been split into 10 groups. One of the 10 groups has been taken as test dataset and rest of them have been taken as training dataset. Then, the model or classifier has been fitted with the training set and evaluated it on the test set based on the performance score. This procedure repeats for 10 times on different test sets and the model performance has been obtained by averaging out the 10 different performance scores. Based on this average performance score, the final model is evaluated. Then, the algorithm with the highest accuracy is considered as the best one. The methods of data pre-processing (i.e., text data handling using topic modeling, missing data handling, class imbalance handling, outlier handling, data transformation from categorical to continuous form), classification (i.e., DNN), and optimization (i.e., SGD, RMSProp, and ADAM) used in this study are briefly discussed in the following section. Some important notations used in this study are mentioned in "Appendix A".

2.1 Data pre-processing

In this section, the five data pre-processing tasks: (i) text data handling, (ii) missing data handling, (iii) class imbalance handling, (iv) outlier handling, and (v) data transformation are discussed below.

2.1.1 Text data handling

The latent Dirichlet allocation (LDA)-based topic modeling is discussed as a data pre-processing tool, which has been used on unstructured text data. To determine the optimal number of topics, four metrics have been used in this study. The first metric (say, ‘Metric1’), has been developed by Griffiths and Steyvers [69], which helps to determine the optimal number of topics by the calculation of the maximum log-likelihood of the data. Cao et al. [70] have developed another metric (say, ‘Metric2’) to compute the stability of the structure of a topic using mean cosine distance between every pair of topics. Using this, it has been found that the stability increases with the decrease in mean distance. Likewise, Arun et al. [71] have developed a Kullback–Leibler (KL) divergence-based metric (say, ‘Metric3’). The choice of the optimal number of topics depends on the minimum divergence. Of late, Deveaud et al. [72] have developed a heuristic search-based metric (say, ‘Metric4’) to determine the number of latent concepts within the user’s query. It is basically done by maximizing the intra-topics information divergence. choice of the optimal number of topics depends on the minimum divergence. Of late, Deveaud et al. [72] have developed a heuristic search-based metric (say, ‘Metric4’) to determine the number of latent concepts within the user’s query. It is basically done by maximizing the intra-topics information divergence. Metric 1 and 4 are based on word-coherence and Metric 2 and 3 are based on word-log-perplexity. Therefore, considering the Therefore, considering the aforementioned metrics simultaneously, two metrics, i.e., Metric2 and Metric3 are to be minimized, whereas the other two, i.e., Metric1 and Metric4 are to be maximized.

After obtaining the optimal number of topics, LDA-based topic modeling is used on unstructured text. In this process, it is assumed that both document and words are obtained from a generative probability model [73]. Each document is obtained by the model given below.

  1. (i)

    \(N_{i} \sim\) Poisson distribution (where \(N_{i}\) is a random variable representing the number of words in i-th document)

  2. (ii)

    \(\theta _{i} \sim\) Dirichlet distribution \((\alpha )\) (where \(\theta _{i}\) is a random variable denoting per document-topic proportion, \(\alpha\) is a proportion parameter, \(\alpha <1\))

  3. (iii)

    \(\beta _{T_{j}} \sim\) Dirichlet \((\eta )\) (where \(\beta _{T_{j}}\) is a per-topic (say, \(T_{j}\)) word proportion parameter, such that \(T_{j} \in \left\{ T \right\}\)

  4. (iv)

    \(z_{i,w_{l}} \sim\) Multinomial \((\theta _{i})\) (where \(z_{i,w_{l}}\) implies the topic of each word \(w_{l} \in \left\{ w \right\}\) in the i-th document, \(D_{i}\))

  5. (v)

    \(w_{i,w_{l}} \sim\) Multinomial \(p (w_{i,w_{l}} \mid z_{i, w_{l}}, \beta _{T_{j}})\) (where \(w_{i,w_{l}}\) is the observed word in \(z_{i,w_{l}}\) topic)

According to the LDA theory proposed by [73], the words are generated from the distribution of the topic. Different topics are able to produce similar words. Based on the words generated within each topic, an intuitive meaning can be ascribed to the topic. To estimate the parameters, a joint distribution of observed and latent random variables is used, which can be expressed in the following Eq. (1):

$$\begin{aligned} \begin{aligned}&p(\beta , \theta , z, w \mid \alpha , \eta ) = \prod _{j=1}^{m}p(\beta _{T_{j}} \mid \eta ) \prod _{i=1}^{n}p(\theta _{D_{i}} \mid \alpha ) \\&\quad \left( \prod _{l=1}^{p}p(z_{i,w_{l}} \mid \theta _{D_{i}}) p(W_{D_{i}, w_{l}} \mid \beta _{1:m}, Z_{i, w_{l}}) \right) \end{aligned} \end{aligned}$$
(1)

Using LDA topic modeling, a set of optimal number of topics are generated. Here, each topic consists of eight key words that are used as predictors and added with the other conditional predictors. The algorithm for the extraction of topics in terms of key words is defined in Algorithm 1.

figure a

2.1.2 Missing data handling

For missing data handling, random forest classifier is used. It uses bootstrap sampling. The observations having missing values in dependent attribute are imputed by randomly drawing from independent attributes. This algorithm is used to fit individual regression tree to a bootstrapped sample and impute or predict each missing value as the prediction of a randomly selected decision tree. Consider a universe \(S = <U, A \cup C>\), where U is the information table, A is the set of predictor attributes, and C is the response attribute. Here, U consists of samples: \(U= \left\{ x_{1}, x_{2}, ..., x_{Q} \right\}\), where Q represents the total number of instances in U, \(A= \left\{ a_{1}, a_{2}, ..., a_{p} \right\}\). Here, p denotes the number of predictors in U. For an arbitrary attribute \(a_{s} (s \in p)\) having missing values at entries \(i_{mis}^{(s)} \le \left\{ 1, 2, ..., Q \right\}\), the dataset is separated into four parts, which are as follows:

  1. (i)

    Observed values of attribute \(a_{s}\) are denoted as \(y_{obs}^{(s)}\).

  2. (ii)

    Missing values of attribute \(a_{s}\) are denoted as \(y_{mis}^{(s)}\).

  3. (iii)

    The attributes other than \(a_{s}\) with observations \(i_{obs}^{(s)} = \left\{ 1, 2, ..., Q \right\} \setminus i_{mis}^{(s)}\) are denoted as \(x_{obs}^{(s)}\).

  4. (iv)

    The attributes other than \(a_{s}\) with observations \(i_{mis}^{(s)}\) are denoted as \(x_{mis}^{(s)}\).

Now, under such condition, missing value imputation is performed using the random forest algorithm. The steps of this process are given below.

  • Step 1: First, find out the percentage of missing values for each \(a_{s}\) in the dataset.

  • Step 2: Sort \(a_{s}\) according to the ascending order of the percentage of missing values.

  • Step 3: For each \(a_{s}\), the random forest algorithm is trained with predictors \(x_{obs}^{(s)}\) and the response attribute \(y_{obs}^{(s)}\). Then, the missing values \(y_{mis}^{(s)}\) are predicted or imputed using this algorithm, which is tested on \(x_{mis}^{(s)}\). This imputation process is repeated until a termination criterion is satisfied. In this study, the user-defined maximum number of iterations is considered as the stopping criterion.

2.1.3 Class imbalance handling

To handle the class imbalance issue in data, Synthetic Minority Over-sampling Technique (SMOTE) algorithm is used. It was proposed by Chawla et al. [74]. It is an oversampling technique. It oversamples the minority class by generating synthetic samples in the imbalanced dataset. Each sample of minority class is considered initially and the new samples are generated along the line segment which connects this sample with its minority nearest neighbor. It usually works by oversampling minority class and undersampling the majority class simultaneously. That is why it can produce better classification performance than only undersampling. The steps of this algorithm used in this study are displayed in Fig. 2.

Fig. 2
figure 2

Algorithmic flowchart of SMOTE algorithm

2.1.4 Outlier handling

The word ‘outlier’ indicates a data object which deviates significantly from the rest of the data. Handling of such data is very important as the existence of such data may negatively impact on the classifier’s performance. To handle this issue, k-modes clustering algorithm has been used in this study [75]. It extends the k-means algorithm to categorical domain by using a suitable dissimilarity measure defined over categorical attributes. The pseudo-code for outlier detection using k-modes algorithm is presented in Algorithm 2.

figure b

2.1.5 Data transformation

After the outlier detection and reduction, the reduced dataset is transformed from its categorical form to the continuous form using a gravity factor (GF)-based normalization technique [6]. In this data processing stage, uncertainty arises as the values of the decision classes are user-defined. GF is a normalized value, which is calculated from the frequency of each of the categories in each attribute. The following Eq. (2) is used to estimate the GF values of the data:

$$\begin{aligned} GF = \frac{\sum _{i=1}^{n}x_{i}y_{i}}{n \times \sum _{i=1}^{n}x_{i}} \end{aligned}$$
(2)

, where \(x_{i}\) and \(y_{i}\) denote the percentage of each category of each predictor and a corresponding normalization factor for risk, respectively. The value of n is taken as equal to three since the response attribute ‘Incident outcomes’ has three classes (i.e., injury, nearmiss, and property damage). The normalization makes the GF values scaled from 0 to 1. The pseudo-code for computing GF values is shown in Algorithm 3.

figure c

2.2 Classification and optimization algorithms

The classification algorithm used in this study is DNN, which is basically a stacked auto-encoder (SAE), consisting of an auto-encoder and a softmax classifier [43]. A brief description of the auto-encoder, SAE, and softmax classifier is given below. In addition, three optimization algorithms, namely SGD, RMSProp, and ADAM are also discussed in this section.

2.2.1 The auto-encoder

The auto-encoder (AE) is basically a feed-forward ANN. It consists of three layers; one input and one output layer, and a hidden layer in between them. The AE is trained in such a manner that the number of nodes at the output layer becomes equal to that of the input layer to map the input space to feature space. In Fig. 3, a network structure with a single hidden layer of ANN is displayed. The number of inputs and outputs are equal, which is equal to M, and the number of hidden nodes is N, where both M and N are non-negative integers. The left and right portions of an AE network are called ‘encoder’ and ‘decoder,’ respectively. The inputs of the encoder are the inputs of the AE and its outputs are the inputs of the decoder. The output of the decoder is the output of the AE. If there are outputs \(c = [c_{1} \; c_{2} \; c_{3} \; ... \; c_{N}]^{T}\), activation function (usually, sigmoidal) f, inputs \(x=[x_{1} \; x_{2} \; x_{3} \;...\; x_{M}]^{T}\), biases \(b = [b_{1} \; b_{2} \; b_{3} \; ... \; b_{N}]^{T}\), and the weights \(W = [w_{1} \; w_{2} \; w_{3} \; ... \; w_{N}]^{T}\), the relationship between input and output in the encoder can be denoted by \(c = g_{E} (b+W^{T}x)\) and can be expressed as the following Eq. (3) [43]:

$$\begin{aligned} c= f(b+W^{T}x) \end{aligned}$$
(3)

Similarly, for the decoder, if there are outputs \({\widehat{x}} = [{\widehat{x}}_{1} \; {\widehat{x}}_{2} \; {\widehat{x}}_{3} \; ..., \; {\widehat{x}}_{M}]^{T}\) , activation function (usually, sigmoidal) \({\widehat{f}}\) at the output layer, biases \({\widehat{b}} = [{\widehat{b}}_{1} \; {\widehat{b}}_{2} \; {\widehat{b}}_{3} \; ..., \; {\widehat{b}}_{M}]^{T}\) , and the weights \({\widehat{W}} = [{\widehat{w}}_{1} \; {\widehat{w}}_{2} \; {\widehat{w}}_{3} \; ..., \; {\widehat{w}}_{M}]^{T}\), the relationship between input and output of the decoder is denoted as \({\widehat{x}} = g_{D}({\widehat{b}}+ {\widehat{W}}^{T}c)\) and expressed as the following Eq. (4):

$$\begin{aligned} {\widehat{x}} = {\widehat{f}}({\widehat{b}}+ {\widehat{W}}^{T}c) \end{aligned}$$
(4)
Fig. 3
figure 3

Auto-encoder network

As stated earlier, an auto-encoder consists of two parts: encoder and decoder. Therefore, the input–output relationship of an auto-encoder is denoted as Eq. (5):

$$\begin{aligned} g_{AE}^{1} = g_{E} \circ g_{D} \end{aligned}$$
(5)

, where \(g_{E}\) and \(g_{D}\) denote the function of encoder and decoder, respectively, \(\circ\) represents the output of an encoder is fed to a decoder as input.

2.2.2 The structure of a stacked auto-encoder (SAE)

The structure of an SAE is developed by cascading operation that helps to generate a number of AEs (refer to Fig. 4). Let L be number of AEs that are cascaded to form stacked auto-encoder (SAE), and let \(g_{AE}^{1}, g_{AE}^{2}, ..., g_{AE}^{L}\) be the function of the aforesaid L auto-encoders. Therefore, the operation of SAE can be expressed as the following Eq. (6):

$$\begin{aligned} g_{SAE} = g_{AE}^{1} \circ g_{AE}^{2} \circ g_{AE}^{3} \circ ... \circ g_{AE}^{L} \end{aligned}$$
(6)
Fig. 4
figure 4

Cascading-based auto-encoder network

2.2.3 The softmax classifier

This classifier is a linear classifier, which can classify the multiple classes. It is used to handle two classes. The working principle of a softmax classifier depends on the principle of logistic regression [76]. After the development of a DNN classifier, the training is done by using optimization algorithms, namely SGD, RMSProp, and ADAM algorithms, which are described below.

2.2.4 SGD algorithm

SGD is a popular optimization algorithm, which is used to minimize the objective function with model parameters \(\theta\). The parameters are updated in the opposite direction of the gradient of \(\bigtriangledown _{\theta } J(\theta )\) [65]. The size of the steps is determined by the learning rate, \(\eta\). Let \(\theta ^{t-1}\) be the value of the parameters for the \((t-1)\)-th iteration. Then, the updated value of the parameters for t-th iteration (where input is x and output is y), defined as:

$$\begin{aligned} \theta ^{t} = \theta ^{t-1} - \eta \times \bigtriangledown _{\theta ^{t-1}}J (\theta ^{t-1};x;y) \end{aligned}$$
(7)

, where \(t = 1,2,...,T\). Here, T represents the total number of iterations. At each iteration, the values of the parameters are updated for every sample present in the dataset. Due to the capability of the SGD of updating the parameters one at a time, it works very faster and can be used in an online settings. Since it shows the frequent update with high variance in the objective function, a high fluctuation is observed, which may help to get the better local minima. This characteristic does not help the algorithm to converge at an exact minimum point. However, decreasing the learning rate slowly may help the algorithm converge to a local or global minimum. The pseudo-code of SGD algorithm is given in Algorithm 4.

figure d

2.2.5 RMSProp algorithm

RMSProp is a GD-based optimization algorithm. The learning rate of this algorithm is adapted for each parameter. To resolve the issues of the ‘vanishing gradient’ and entrapment of solution into local optimum, RMSProp algorithm is used for training a DNN algorithm [77]. It uses a moving average of squared gradients. It can balance the step size by decreasing the steps for the large gradient to avoid ‘exploding’ and by increasing the steps for the small gradient to avoid ‘vanishing.’ The algorithm weighs the recent past more heavily as compared to distant past. As a consequence, it explores the effectiveness of the optimization algorithm for DNNs. The pseudo-code of RMSProp algorithm is given in Algorithm 5.

figure e

2.2.6 ADAM algorithm

ADAM is a gradient-based first-order stochastic optimizer [64]. From the first and second moment estimates of the gradients, it calculates the adaptive learning rates (ALR) of different parameters. The main advantage of this method includes the capability of handling the issue of sparse gradients. There are a few more advantages, for example, it works without a stationary objective, the parameters are updated without depending on the rescaling of the gradient, it has little memory requirement, etc. The algorithm starts with initializing the moving averages (MAs) by setting them at zeros. Let \(g_{t}\) be the gradient of a stochastic objective function J at t-th iteration, and let \(m_{t}\) be the first moment (i.e., the mean of gradients) and \(v_{t}\) be the second moment (i.e., variance of the gradients) at t-th iteration. \(m_{t}\) and \(v_{t}\) are defined as follows:

$$\begin{aligned} g_{t}= & {} \bigtriangledown _{\theta ^{t}} J(\theta ^{t}) \end{aligned}$$
(8)
$$\begin{aligned} m_{t}= \,& {} \beta _{1}m_{t-1} + (1-\beta _{1})g_{t} \end{aligned}$$
(9)
$$\begin{aligned} v_{t}= \,& {} \beta _{2}v_{t-1} + (1-\beta _{2})g_{t}^{2} \end{aligned}$$
(10)

, where \(m_{t-1}\) and \(v_{t-1}\) denote the first and second moment of gradients at \((t-1)\)-th iteration, respectively. The hyper-parameters \(\beta _{1}\) and \(\beta _{2}\) indicate the first and second exponential decay rates for the moment estimates, respectively, and \(\beta _{1},\beta _{2} \in [0,1)\). These parameters basically control the decay rates of the MAs. Since the initialization of MAs starts with zeros, the moment estimates become biased toward zero. In order to counteract the initialization bias, bias-corrected first and second-moment estimates are computed using the following Eqs. (11) and (12), respectively:

$$\begin{aligned} {\widehat{m}}_{t}= & {} \frac{m_{t}}{1-\beta _{1}^{t}} \end{aligned}$$
(11)
$$\begin{aligned} {\widehat{v}}_{t}= & {} \frac{v_{t}}{1-\beta _{2}^{t}} \end{aligned}$$
(12)

Later, the parameters are updated using the update rule in ADAM, as given in Eq. (13):

$$\begin{aligned} \theta ^{t+1} = \theta ^{t} - \frac{\alpha }{ \sqrt{{\widehat{v}}_{t}} + \epsilon } {\widehat{m}}_{t} \end{aligned}$$
(13)

, where \(\epsilon\) is a very small value. The pseudo-code of the ADAM algorithm is provided in Algorithm 6. The pseudo-code of ADAM-based DNN, i.e., ADNN is provided in Algorithm 7.

figure f
figure g

3 Case study

The data consisting of a total of 9473 incident records have been retrieved from a steel manufacturing plant over the period of 2010 to 2013. After collection of data, they are preprocessed. The dataset comprises a total of thirteen attributes (11 categorical, and two free unstructured text attributes), of which ‘Incident outcome’ is deemed as the response attribute. A short description of attributes with their percentage of occurrence in the dataset is provided in Table 1. After the data collection, they have been preprocessed. In pre-processing, some basic tasks have been done, for example, removal of inconsistencies manually, and missing data by random forest algorithm. Then, unstructured text attributes have been converted into a categorical attribute using topic modeling. Thereafter, class imbalance problem is handled using SMOTE algorithm.

Table 1 Attributes with percentage of occurrence in the dataset.

4 Results and discussions

In this section, the generation of a new categorical attribute using topic modeling, evaluation of the importance of attributes using chi-square approach, hyper-parameter study, and prediction of incident outcomes are discussed in details.

4.1 Generation of a categorical attribute using topic modeling

There are two text attributes, called ‘BD’ and ‘EL’ within the dataset, which comprise the description of incidents. LDA topic modeling has been used to create a categorical attribute from the texts. Four metrics have been used simultaneously to determine the number of topics optimally from the attributes of the unstructured text. From the topic modeling, a total of nine topics are extracted (refer to Fig. 5). With each of the topics, an exhaustive list of terms, based on the probability of occurrence, is generated. Using the list, a meaningful event is obtained for each topic. For the purpose of visualization, the top eight terms per topic with its corresponding probability of occurrence are shown in Table 2. For instance, in Topic1, the top eight terms, ‘road,’ ‘shift,’ ‘near,’ ‘injury,’ ‘come,’ ‘sudden,’ ‘duty,’ and ‘fell’ are found. From the list of the terms extracted, it can be inferred that Topic1 describes ‘Falling’ as a meaningful event, which has been later validated by five domain experts.

Fig. 5
figure 5

Optimal number of topics from incident narratives

Table 2 Eight top terms with probabilities for each of the nine topics

4.2 Evaluation of feature importance using Chi-square approach

Once the dataset has been pre-processed, chi-square test is conducted for the evaluation of the importance of attributes. The higher values of chi-square in Fig. 6 suggest that the attributes, such as ‘Employee types,’ ‘Topic,’ ‘Incident events,’ and ‘Machine condition’ are the significant predictors for the prediction of incident outcomes.

Fig. 6
figure 6

Importance of attributes in prediction

4.3 Hyper-parametric study

A tree-based regression model is used to find the optimal values of the hyper-parameters of DNN for producing the best accuracy. The hyper-parameters of a DNN include learning rates, activation function, the number of hidden layers, and the number of neurons in each hidden layer. First, the model is evaluated based on the values of the parameters, which are initially set at random. The model is then improved by sequentially evaluating the cost function for a number of evaluations, which is set equal to 10 (i.e., ‘n_calls=10’). This is performed for all the three optimized DNNs. For ADNN, the best results are obtained with a single layer of the six hidden layers with 5, 7, 7, 6, 4, and 5 neurons, respectively, a learning rate of 0.061, and ‘rectified linear unit (ReLU)’ activation function. For RMSProp-DNN, the best results are achieved with a setting of five hidden layers with 3, 3, 5, 4, and 4 neurons, respectively, the learning rate of 0.0041, and ‘ReLU’ activation function. Similarly, for SGD-DNN, the best results are obtained with the five hidden layers with 6, 4, 3, 4, and 3 neurons, respectively, the learning rate of 0.00001, and ‘tanh’ activation function. The ranges and the optimal values of the optimized classifiers are listed in Table 3. The convergence plots of all the three cases are also recorded and depicted in Fig. 7a–c.

Table 3 Hyper-parameters of the optimized classifiers

4.4 Prediction

This section demonstrates the classification performances of the three optimized DNNs, namely ADNN, RMSProp-DNN, and SGD-DNN. Evaluation of the performances is done based on incident data and five other benchmark datasets, namely ‘Breast cancer,’ ‘Iris,’ ‘PID,’ ‘Hungarian,’ and ‘Cleveland’ retrieved from UCI Machine Learning RepositoryFootnote 1. Besides these, other three state-of-the-art classifiers, namely k-NN, SVM, and RF are also employed using 10-fold cross-validation to perform further checking of the performance. From Table 4, it can be seen that the maximum accuracy 78.8% is generated by ADNN, whereas RMSProp-DNN and SGD-DNN produce the accuracies 78.1%, and 76.1%, respectively. Other algorithms, RF, k-NN, and SVM produce the best accuracy of 73.28%, 70.76%, and 65.11%, respectively. With the experiments, the corresponding graphs, i.e., accuracy versus epochs are also depicted for the three optimized algorithms in Fig. 7d–f. In addition, for comparative study, the analysis related to ‘loss versus epochs’ is plotted for the three models (refer to Fig. 7g). The trend of accuracies obtained by the six models, namely ADNN, RMSProp-DNN, SGD-DNN, SVM, k-NN, and RF is somewhat similar in nature in terms of the order of best accuracies on five benchmark test datasets. Hence, from the experimental results reported in Table 4, it is explored that ADNN classifier produces the highest accuracy for all the datasets. Although k-NN, RF, SGD-DNN, RMSProp-DNN, and SVM are compared with the proposed ADNN, DNN parameters are tuned using SGD, RMSProp, and ADAM optimizers to determine the best one since these optimizers are best suited for tuning deep learning model parameters rather than others. Therefore, convergence plots are exhibited for these three optimizers only.

It is to be noted that in all datasets, information related to attributes and their respective classes are given. Using this information, important attributes are extracted and used in analyses for better classification performance. The extraction of attributes is done by using a stacked auto encoder. For example, the stacked auto encoder is applied over the data, ‘Iris,’ which has four attributes, namely, ‘sepal length,’ ‘sepal width,’ ‘petal length,’ and ‘petal width,’ and three decision or response classes, namely ‘setosa,’ ‘virginica,’ and ‘versicolour.’ Now, the dataset is divided into two sets: training set and test set. The number of inputs used in an auto encoder is same as the number of attributes in the data. Hyper-parameters are multiplied with attribute values for each input sample to extract its (e.g., input sample) feature value. This feature extraction is done in hidden layer of auto encoder. For all the samples in input data, a feature map is generated. The feature map constitutes all the feature values generated from the input data. In stacked auto encoder, the generated feature map is fed to the next auto encoder to get its reduced feature map. In this way the feature map is passed through all the auto encoders, and finally generate more reduced feature map. Input data can be represented concisely using its reduced feature map. Instead of taking the entire input data, its corresponding reduced feature map is used for softmax classification; thereby increasing the classification accuracy as well as reducing the computational time. Therefore, useful feature map extraction and its softmax classification is more advantageous and necessary than just only using softmax classification on the entire input data.

Fig. 7
figure 7

Convergence plots: a Error plot of ADNN, b error plot of RMSProp-DNN, c error plot of SGD-DNN, d accuracy plot of ADNN, e accuracy plot of RMSProp-DNN, f accuracy plot of SGD-DNN, and g loss vs epochs plot of ADAM, RMSProp, and SGD-based DNN classifiers

4.4.1 Performance evaluation and comparison

Besides the accuracy, other performance measures including sensitivity (i.e., recall), F–measure, and precision are also evaluated to compare the classification models. The results of ADNN, RMSProp-DNN, SGD-DNN, SVM, k-NN, and RF are summarized in Table 5.

Table 4 Best accuracy of different classifiers on different datasets using 10-fold cross-validation
Table 5 Performance metrics of the six classifiers

4.4.2 Statistical test for significance

Following the strategy adopted by [78,79,80], two nonparametric tests, Wilcoxon signed-ranks test and Mann–Whitney U test are carried out with 95% confidence interval (i.e., significance level, \(\alpha = 0.05\)) for comparison of the performance of ADNN with each of the other two models, i.e., RMSProp-DNN and SGD-DNN. Results reveal that there exist significant differences between ADNN and the other two models since \(p < 0.05\) (refer to Table 6). In addition, the results of the Mann–Whitney U test also support the findings of the previous test (refer to Table 7).

Table 6 Results of the Wilcoxon Signed Rank test for ADNN, RMSProp-DNN, and SGD-DNN for the incident dataset
Table 7 Results of the Mann–Whitney U test for ADAM-DNN (ADNN), RMSProp-DNN, and SGD-DNN for each dataset

4.4.3 Robustness checking of the classifiers

Robustness checking of the three optimized classifiers is carried out using five independent runs with 10-fold cross-validation for every run. Adopting the process of Oztekin et al. [81], seeds are randomly selected for splitting the dataset into training and testing. Five different numbers, i.e., 221, 223, 225, 227, and 229 are assigned to seeds for five different runs, which, in turn, produces a set of 50 cross-validation accuracies (i.e., 10-fold cross-validation accuracies per seed \(\times\) 5 seeds). Based on these values, a box plot is generated for each of the six classifiers (i.e., ADAM-DNN, RMSProp_DNN, SGD-DNN, SVM, k-NN, and RF) (refer to Fig. 8). From the figure, it is unveiled that the ADNN algorithm shows the maximum accuracy values with the least range of dispersion; whereas the minimum accuracies are yielded by the SVM algorithm. Maximum dispersion of accuracies is observed for k-NN algorithm. Therefore, from the comparative study, the ADNN classifier can be deemed as the robust model.

Fig. 8
figure 8

Box plot analysis for robustness checking of different classifiers

From the experimental analyses, it is to be noted that one or more hyper-parameters are set to a particular value in ML-approach, which influences the testing accuracy of the classification algorithms. With the proper selection of the hyper-parameter values, the ML algorithm performs with the optimum accuracy. Therefore, parameter tuning is a very important task in ML approach. This tuning process comprises three steps: Step 1: Parameters are randomly initialized with some weight values; Step 2: error/loss is calculated based on the weight values; and Step 3: the error is propagated back to update the weight values such that the error should be minimized. In such cases, backpropagation of the errors can be made using several optimization algorithms, such as gradient descent (GD) and stochastic gradient descent (SGD) algorithms. In GD-based optimization algorithm, first, the error is calculated for the entire dataset and then the error is back propagated for weight updation. Therefore, it takes a lot of time to move even a single step closer to the optimum weight value/cost. This problem is solved in SGD, where the entire dataset is divided into several mini batches of size one. After passing through one mini batch, the parameters are updated; thereby speed up the system. These two optimization algorithms are further speeded up by incorporating momentum that defines the desired direction of the learning process so that parameters can reach to optimality with comparatively shorter time. As stated before, SGD is better than GD and hence, SGD with momentum is superior to GD with momentum. RMSProp is a GD-based optimization algorithm. It combines GD with momentum. Therefore, it takes less time than GD to achieve optimality. Whereas, in SGD, the hyper-parameters are updated after passing through one mini batch, having the size equal to one. ADAM combines the SGD with momentum. SGD reduces the searching space and momentum increases the learning rate. Therefore, ADAM is superior to GD, SGD, and RMSprop algorithms in terms of computation time.

5 Conclusions

The present study proposes a new methodology for the development of a prediction model which enables us to predict the incident outcomes using the hidden information underlying the data. DNN, an effective and powerful classifier, has been used in this task. The findings of the study allow us to draw some useful insights regarding the handling of unstructured text data and parameter optimization of the classifiers. For instances, topic modeling is found to be a very effective tool used for the analysis of unstructured texts. Moreover, using chi-square testing, the ‘topic’ is also found to be one of the important predictors of incident outcomes. Besides this, other attributes including ‘Employee types,’ ‘Incident events,’ and ‘Machine condition’ are found to be the important determinants of the response attribute. Key words related to each topic are added with other categorical predictors to form the input feature space. This input feature space is fed to the DNN for prediction. In order to achieve the improved accuracy in classification, the parameters of DNN are tuned by the three optimization algorithms, namely RMSProp, ADAM, and SGD, separately. From this study, it is evident that the proposed approach ADNN is found to be the best classifier with the highest accuracy. In support of the findings from the experiments, other algorithms, namely SGD-DNN, RMSProp-DNN, SVM, k-NN, and RF have also been applied to the incident dataset. The results reveal that the ADNN classifier outperforms others in all cases. Further, all the algorithms used in this study have been tested using five different available benchmark datasets. In all these cases, ADNN algorithm performs the best. In order to check whether the performance of ADNN algorithm differs significantly or not from other algorithms (i.e., SGD-DNN, and RMSProp-DNN), two statistical tests, namely Wilcoxon signed ranked test and Mann–Whitney U test have been carried out. Results reveal that our proposed algorithm significantly performs better than the others. Finally, using boxplot analysis, ADNN is found to be the most robust classifier. Therefore, the present study is expected to have potential to contribute both in theoretical and practical aspects.

5.1 Theoretical contributions

From the theoretical point of view, the study offers a number of contributions. First, the proposed methodology shows a new way to handle issue of the use of unstructured texts in analysis using LDA-based topic modeling. Second, the methodology explores a strategy of using parameter optimization of classifiers for increasing prediction performances. Third, the higher predictive accuracy of the optimized classifiers reveals that incidents do not occur in a chaotic fashion, but hidden patterns do exist. Therefore, these patterns can be explored and captured with the use of machine learning techniques. This finding suggests that occupational safety should be studied empirically in a systematic way rather than strictly following a qualitative approach through subjective, expert-opinion-based data analysis.

5.2 Practical implications

From the practical point of view, the study has some real implications. The study can help decision-makers like safety professionals to predict the possible outcomes of incidents. It may be either injury or near-miss, or property damage. Based on this predicted outcome, safety-related decisions can be undertaken, such as working places should be cleaned and free of spillage of oil or any liquid, proper illumination level at working places should be maintained, unexposed cables to be removed from working places, and others. In addition, it can help decision-makers pre-process data by addressing the issues of handling of unstructured text in analysis. The use of LDA helps in automatic text classification which enables safety managers to identify useful information (such as probable accidents with severities) by extracting and relating relevant data present in documents. It is useful when it is used on proactive data (i.e., information that lead to incident). The proactive data indicates the data collected prior to the occurrence of any incident, for example, an inspection report. This report usually narrates the date, time, location, machine condition, working nature of a worker, type of activity being performed by workers, pre-incident working conditions, etc. Using both incident and inspection data, one attribute ‘Incident’ can be generated which has two classes, ‘Yes,’ or ‘No.’ The ‘Yes’ means the incident has occurred; on the other hand, ‘No’ means the incident has not occurred. Using automatic text classification on this information, a safety manager can at least classify the documents as either ‘Yes’ or ‘No.’ If any new document (from inspection) is classified as ‘Yes’ by the classification algorithm, it means that there is a possibility of the occurrence of an incident. With the help of this information, the safety manager can take proactive measures to prevent this occurrence. Moreover, the evaluation of the importance of attributes toward incident outcome prediction can help the decision-makers identify the important and unimportant attributes. Therefore, they can put more focus on the important attributes or factors responsible for incidents and accordingly, the factors can be improved or eliminated from the system to prevent the occurrences of incidents.

However, like other studies, this study has also some limitations. The study suffers from the issue that demands an extensive human labor, which is necessary to sanitize the data prior to analysis. This is a time-consuming and less effective process. Further, the dataset consists of a limited number of incident records. It is noteworthy to mention that using a substantial amount of data in the analysis is necessary for achieving the model’s generality. Based on the study carried out, some interesting avenues could be explored for the future research. For examples, the study could be expanded to develop an ADNN-based automated decision support system (ADSS) [82] which not only enables us for prediction but also facilitates smart decision-making based on the generation of rules. Therefore, the present study can be useful for academics and researchers through the development of a new methodology to overcome the issues of unstructured and hidden information in data. Moreover, to resolve the issues, the study could be expanded beyond the manufacturing industry, such as construction, process industry, aviation, and so forth.