1 Introduction

A control chart demonstrates a virtual representation of a process over time; hence, they are a valuable tool for monitoring process performance in statistical process control (SPC). Control charts have been widely used in industrial processes since the 1920s in various applications; see, for example, Aslam et al. (2020). Usually, control charting in SPC is implemented in two phases: Phase I (retrospective phase) and Phase II (monitoring phase). The aims of Phase I include estimation of process’s stability, the in-control (IC) parameters and control limits through a preliminary retrospective study using historical datasets; while, in Phase II analysis, one uses the IC model, obtained from Phase I, as a base scheme to control the process in real time and detect outliers as quickly as possible (Yeganeh et al. 2022a). So, it is vital to control the process over time in Phase II as changes in the process parameters might be caused by some unnatural patterns which might be due to faults, non-compatible products, low quality raw materials and so forth (Montgomery 2019; Gupta et al. 2006). In on-line monitoring (Phase II analysis), the instability of a process, should be identified as early as possible, is declared with the out-of-control (OOC) signals. This is usually evaluated in terms of the required number of drawn points on a control chart before an OOC signal; it is denoted as average run length (ARL). In a fair evaluation, the ARL value for the IC process, referred to ARL0, is adjusted at a constant and desired value and the charts endeavour to provide the minimum ARL value in OOC condition, called ARL1 (Yeganeh et al. 2023). Note that the greater the ARL1, the weaker the detection ability of a control chart (Montgomery 2019). In addition to the ARL, control charts’ performance metrics include the, performance comparison index (PCI), standard deviation of run length (SDRL), relative average run length (RARL) and extra quadratic loss (EQL) (Riaz et al. 2014).

To construct a control chart in SPC either in Phase I or Phase II, two different approaches can be used: (i) investigation of the distribution function of a single or multiple quality characteristics; (ii) checking the stability of a functional relationship between a dependent (response) and one or several independent (explanatory) variables over time. The term profile monitoring hints using of SPC techniques to investigate the stability of the functional relationship over time instead of monitoring a single or multivariate quality characteristic.

It is noteworthy to mention that this topic has firstly been introduced with ‘signature’ term in Gardner et al. (1997) and the use of ‘profile’ became more commonplace after the extension of exponentially weighted moving average (EWMA) control charts in monitoring simple linear profiles in which there is a linear relationship between a response and an explanatory variable (Kang and Albin 2000). Since then, many researchers have been focusing on monitoring linear profiles; see, for example, Gupta et al. (2006), Zou et al. (2007), Huwang et al. (2014), Motasemi et al. (2017), Haq (2020), Yeganeh et al. (2021) and the references therein. Also, some researchers focused on monitoring other types of profiles including nonlinear (Williams et al. 2007), roundness (Pacella and Semeraro 2011), exponential (Steiner et al. 2016), circular (Zhao et al. 2020), multi-channel autoregressive (Zhou et al. 2022) and non-parametric (Jones et al. 2020; Zou et al. 2008; Nassar and Abdel-Salam 2021).

As a secondary approach, some researchers considered two restricting assumptions in all the monitoring schemes. First, the response variables might not be continuous and have a discrete form. For example, the response variable may demonstrate the number of defects (percent of defective) per item. Secondly, it is usually assumed that the random error and therefore, the response variables follow the normal distribution. For many real-life applications, the normality assumption can be violated. Considering these limitations, generalized linear models (GLM) constituting a large class of statistical models relating responses to linear combinations of predictor variables, have been extended to the profile monitoring regime. There are two major categories of GLMs that have had the most applications in the literature, these are logistic and Poisson profiles. For more details on the former, readers are referred to the works of Yeh et al. (2009), Shang et al. (2011), Huwang et al. (2016), Alevizakos et al. (2019a) and Mohammadzadeh et al. (2021). Recently, many studies are focused on monitoring Poisson profiles, for instance, Zhou et al. (2012) proposed an EWMA chart with random sample sizes where it was observed that using a novel updating formulation, not only yields a more robust IC and OOC performance, but also a generally more sensitive chart to small and moderate shifts. The Phase I monitoring of Poisson profiles has been carried out by Amiri et al. (2015), where they developed three different schemes based on the likelihood ratio test (LRT), Hotelling’s T2 and F statistics. To extend this work in Phase II, Qi et al. (2016) proposed the weighted LRT (WLRT) scheme by combining the EWMA and LRT statistics. They also evaluated the LRT, LRT-EWMA and multivariate EWMA (MEWMA) control charts in Phase II applications of Poisson profiles under the assumption of fixed and random explanatory variables. The results showed the superiority of WLRT over other competitors. Later, Qi et al. (2017) extended the WLRT approach for autocorrelated processes. A change point statistic was developed by Shadman et al. (2017) which was also applied for efficient monitoring of linear profile parameters (Xu et al. 2012) for the GLM profiles. By considering all samples as the candidate change point, they computed LRT statistic for the two groups of samples including IC (or before change point) as well as OOC (or after change point) for both logistic and Poisson profiles. More recently, the change point approach was also implemented for autocorrelated Poisson profiles in He et al. (2020). In addition to the existence of autocorrelation, they assumed random explanatory variables. Another LRT control chart for profiles with random variables was proposed by Song et al. (2021). In addition to random predictors, they considered profiles with auto-correlation. Similar to Shadman et al. (2017), Shang et al. (2018) extended MEWMA charts for Poisson and logistic profiles by considering no prior information about the process in the OOC situations (i.e. non-parametric models). In other words, only the profile parameters have been changed and the type of relation is fixed in the parametric models; whereas, the relation can be transferred to other types, for example, changing a linear IC profile to a non-linear OOC profile, without any limitations. Some remedial methods in parameter estimations of Poisson profiles and computation of process capability index can also be found in Maleki et al. (2019) and Alevizakos et al. (2019b). A non-parametric approach to generalized likelihood ratio and the EWMA schemes for a real case study could be found in Wang et al. (2022).

The investigation of the related mentioned literature and the existing review papers in this field reveal that there is little attention given to machine learning approaches in comparison to statistical approaches; not only in the GLM profiles, but also in all types of profiles monitoring (Maleki et al. 2018; Woodall 2007). To the best of the authors’ knowledge, artificial neural network (ANN) has only been used for profile monitoring in Hosseinifard et al. (2011), Pacella and Semeraro (2011), Yeganeh et al. (2022a) and Yeganeh and Shadman (2020). Li et al. (2019) used support vector regression (SVR) technique for function fitting in the non-linear profile monitoring process. Autoencoders and transfer learning are part of the deep learning techniques that have also been developed in autocorrelated (Chen et al. 2020) and multiple profiles (Fallahdizcheh and Wang 2022). One of the main reasons of this reluctance, may be the weaker performance of machine learning techniques than statistical approaches in terms of ARL. For example, the ANN-based control chart proposed by Hosseinifard et al. (2011) were not able to improve on the performance of conventional EWMA control charts for the detection of most of the shifts when there is a simple linear profile model. To remedy this weakness, Yeganeh and Shadman (2020) improved the performance of the Hosseinifard et al. (2011)’s control chart using supplementary run-rules but there was no modification implemented in the structure of the ANN by Hosseinifard et al. (2011). The same can be said about other mentioned machine learning-based control charts; in other words, they used a simple conventional structure of ANN or SVR and it may be one of the possible reasons for the weakness in their performance. Another weakness of machine learning techniques may be related to their complexity. Although machine learning techniques are more complex and considered as a “black box”, they can produce more accurate results (Cuentas et al. 2022). On the other hand, with the rapid development of digital technologies, the complexity issue of machine learning techniques, in particular deep learning models, are becoming less important over time in real-world applications, in which several complicated on-line models, such as image processing, computer vision and speech recognition, have been developed that can easily be applied in real applications (Chen et al. 2020). In addition, interpreting the prediction of machine learning methods have been studied recently (Pourpanah et al. 2016). However, it is not the focus of this study, but it is an interesting area that requires further investigations.

Considering the limited use of machine learning techniques in profile monitoring, this paper introduces a novel SVR structure as a control chart for monitoring Poisson profiles in Phase II of SPC. The aim of this study is to develop a control chart with quicker detection ability over conventional schemes in Poisson profile monitoring problem which is equivalent to have a more optimized process with lower non-compatible products, cost, waste and other non-desirable outputs. To achieve this, we first define/extract more informative input features and then fed them into a well-known machine learning technique i.e., SVR for training in an offline manner. Finally, the trained models can be implemented to monitor the process online and detect any OOC situations. It makes considerable contributions not only in the input features of the SVR but also in the training procedure to enhance the sensitivity in detecting OOC situations. In addition to improvement in detection ability of a Poisson process with machine learning, other contributions of this paper can be summarized as follows:

  • Develop a novel structure of SVR as a base control chart.

  • Introduce a new input layer structure for the proposed and other related schemes.

  • Taking advantage of a novel training of the proposed SVR.

  • Enhance the detection ability of the proposed charting technique in comparison with ANN.

  • Evaluation of the performance of the proposed scheme under parametric and non-parametric scenarios.

  • Use of the diagnosis procedure in Poisson profiles with SVR.

The rest of this paper is briefly structured as follows: the fundamental framework of Poisson profile model used in Phase II monitoring and the formulations of the two fundamental control chart concepts are briefly presented in Sect. 2. in addition, a brief introduction to some important concepts about SVR is given in Sect. 2, which include the principles of evolved SVRs and a description of particle swarm optimization (PSO) algorithm. Section 3 provides a full description of the proposed SVR-based control chart. Section 4 investigates the performance of the proposed scheme in terms of the ARL and SDRL. The comparisons with the existing counterparts are presented in Sect. 5. Section 6 presents the diagnosis procedure of the proposed approach, while Sect. 7 provides an illustrative example. Finally, the conclusion, recommendations and future research works are presented in Sect. 8.

2 Preliminaries

2.1 Phase II Poisson profile monitoring

Assume that n observations are collected for the jth (j = 1, 2, …) random profile. Let (xijk, yij) represent the pairs of observations on two variables from the jth random profile available in the GLM profiles in the form of Xij = (xij1, xij2,…, xijp), with i = 1,2,…,n and k = 1,2,…,p. Since in fixed design monitoring, the explanatory variables are assumed to be constant in each profile, the j indices are omitted in the explanatory variables then Xij = Xi \(\forall j\). Thus, they can be written as an n × p matrix denoted as \(\tilde{\varvec{X}}\) and can be defined by

$$\tilde{\varvec{X}} = \left( {\begin{array}{*{20}c} {{\varvec{X}}_{{\varvec{1}}} } \\ \vdots \\ {{\varvec{X}}_{{\varvec{n}}} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {x_{11} } & {x_{12} } & \cdots & {x_{1p} } \\ \vdots & \vdots & \ddots & \vdots \\ {x_{n1} } & {x_{n2} } & \cdots & {x_{np} } \\ \end{array} } \right).$$
(1)

A GLM model in the jth sample consists of the following three main components:

  1. (i)

    The 1 × n response vector Yj = (y1j, y2j, …, ynj) under the discrete distribution, with the mean \(\mu_{ij} = E\left( {y_{ij} |(x_{i1} ,x_{i2} , \ldots ,x_{ip} } \right)\), which belongs to the same distribution from the exponential family (e.g., Poisson or binomial distribution). Considering the independency of observations within and between profiles, we have µj = (µ1j, µ2j, …, µnj) for the jth profile.

  2. (ii)

    The matrix of the independent variables are the same as those in (1).

  3. (iii)

    The monotone link function g that connects the mean of the response variable to the combination of the linear predictors in the way that g(j) =  ηj = \({\tilde{\varvec{X}}\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\beta } }_{j}\) where ηj is the linear combination of the jth profile parameters \(\left( {\mathop {{\beta}} \limits^{\frown }_{j} } \right)\). From the above framework, a Poisson GLM model is given by:

    $$\begin{gathered} {\varvec{Y}}_{{\varvec{j}}} = Poisson\left( {{\varvec{\mu}}_{{\varvec{j}}} } \right){,}j = 1{,}2{,}....{,} \hfill \\ log\left( {{\varvec{\mu}}_{{\varvec{j}}} } \right) = {\tilde{\varvec{X}}\beta }_{{\varvec{j}}} {.} \hfill \\ \end{gathered}$$
    (2)

The parameters of model (2) denoted by \(\mathop {\varvec{\beta}_{j}}\limits^{\frown } = \left( \mathop {\beta_{1j}}\limits^{\frown } ,\mathop {\beta_{2j}} \limits^{\frown },...,\mathop {\beta _{pj}} \limits^{\frown } \right)\), are estimated with the iterative weighted least squares (IWLS) algorithm in this paper. However, to save space, this is not included here. For more details readers are referred to Yeh et al. (2009) and Amiri et al. (2015). Hence, the aim of monitoring Poisson profiles is defined as the detection of changes in \(\mathop {{\beta_{j}}} \limits^{\frown }\) from its IC value, denoted with β0 = (β10, β20, …, βp0). Note that monitoring explanatory variables in GLM profiles is as important as monitoring the response variable; see for example, Shang et al. (2011); note though, this is not the focus of this paper.

2.2 Existing control charts

In this subsection, the details of the two existing fundamental approaches based on the MEWMA and LRT schemes are provided (Qi et al. 2016). To simultaneously control the p-dimensional IC coefficient vector β0, its estimate \(\left( {\mathop {{\varvec{\beta}} _{j} }\limits^{\frown } } \right)\) is scaled as follows:

$${\varvec{Z}}_{j} = {\varvec{S}}\left( {\mathop {{\beta}} \limits^{\frown }_{j} - {\varvec{\beta}}_{0} } \right),$$
(3)

with \({\varvec{S}} = \left( {\tilde{\varvec{X}^{\prime}}W\tilde{\varvec{X}}} \right)^{\frac{1}{2}}\) and \(\mathop {{\varvec{\beta}} _{j} }\limits^{\frown } = \mathop {\beta _{{1j}} }\limits^{\frown } ,\mathop {\beta _{{2j}} }\limits^{\frown } ,...,\mathop {\beta _{{pj}} }\limits^{\frown }\) where S is a p × p symmetrical matrix and β0 is the IC \(p\)-dimensional parameters vector. Considering µ0 = (µ10, µ20, …, µn0) and \(log\left( {\mu_{i0} } \right) = \varvec{X^{\prime}}_{i} {\varvec{\beta}}_{{\varvec{0}}}\), W is an n × n diagonal symmetrical matrix with the main diagonal elements given by µ10, µ20,…, µ(n−1)0 and µn0. It is worth mentioning that S depends on \(\tilde{\varvec{X}}\) in Eq. (1) hence there will be the varied S matrix, indeed using of Sj instead of S, when the explanatory variable is not constant in each profile.

From (3), the EWMA statistic for the scaled p-dimensional parameters vector is defined as:

$${\varvec{E}}_{j} = \lambda {\varvec{Z}}_{j} + (1 - \lambda ){\varvec{E}}_{j - 1} ,j = 1,2,...,$$
(4)

where E0 is a vector of zeros with size (p + 1) and λ (0 < λ < 1) is the EWMA constant (or smoothing parameter) which we considered to be equal to 0.2. Thus, MEWMA statistic is given by:

$$M_{j} = \varvec{E^{\prime}}_{j} {\varvec{E}}_{j} .$$
(5)

For more details on the LRT formulation for the Poisson profiles, readers are referred to Amiri et al. (Amiri et al. 2015). By taking the logarithm on both sides of the joint likelihood function of the independent observations, the LRT statistic is constructed as:

$$LRT_{j} = 2\left( {l_{j} \left( {\mathop {\varvec{\beta}} \limits^{\frown }_{j} } \right) - l_{j} \left( {{\varvec{\beta}}_{0} } \right)} \right),$$
(6)
$$\begin{aligned} l_{j} \left( {\mathop {{\varvec{\beta}} _{j} }\limits^{\frown } } \right) = & \sum\limits_{{i = 1}}^{n} {y_{{ij}} \log \left( {\mu _{{ij}} } \right)} - \sum\limits_{{i = 1}}^{n} {\log \left( {\mu _{{ij}} } \right)} - \sum\limits_{{i = 1}}^{n} {\log \left( {y_{{ij}} !} \right)} , \\ l_{j} \left( {{\varvec{\beta}} _{0} } \right) = & \sum\limits_{{i = 1}}^{n} {y_{{ij}} \log \left( {\mu _{{i0}} } \right)} - \sum\limits_{{i = 1}}^{n} {\log \left( {\mu _{{i0}} } \right)} - \sum\limits_{{i = 1}}^{n} {\log \left( {y_{{ij}} !} \right)} , \\ \mu _{{ij}} = & e^{{{\varvec{X}}_{i}^{\prime } \mathop {\varvec{\beta}} \limits_{j}^{\frown } }} , \\ \mu _{{i0}} = & e^{{{\varvec{X}}_{i}^{\prime } {\varvec{\beta}} _{0} }} . \\ \end{aligned}$$

The likelihood function which evaluates the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters is demonstrated by lj(·) in the LRT scheme. It is calculated by consideration of the IC process and estimated parameters of profiles in (6).

2.3 SVR formulation

In 1995, an innovative machine learning method called support vector machine (SVM) was introduced by Vapnik (1995) in order to rectify the drawbacks of the ANN methods especially in classification problems. The idea of SVM is based on the minimization of the training error with empirical or structural risk decreasing. For this aim, the features of a nonlinear problem is mapped to another hyperplane with the aim of maximization of the geometric margins and minimization of the classification error. Although SVM has been able to present itself as a powerful method in supervised learning classification problems, its general form only solves the binary classification problems and some treatments should be applied in multi-class classifications and regression problems. In this paper, SVM for regression (hereafter, SVR) is used and it is briefly described in this section. For more details, an interested reader is referred to Vapnik (1995), Cortes and Vapnik (1995), Vapnik (1998) and Stoean and Stoean (2014).

Similar to other supervised learning techniques, a train dataset is firstly prepared. For simplicity but without loss of generality, we show inputs and targets with Bg and Tg respectively in a way that g is the indicator of samples’ indexes. Also, it is assumed that inputs and targets are continuous values with dimension U and 1, and there are G samples in the training dataset. Hence, we have a G × (U + 1) dataset and it could be shown with (Bg, Tg); g = 1, 2, …, G. Conventional SVM and SVR usually utilise the following formulation for establishing a relationship between inputs and outputs (estimated targets)

$$f\left( {B_{g} } \right) = w\phi \left( {B_{g} } \right) + b.$$
(7)

In (7), a predefined kernel function ϕ(·) in combination with some weights (w) and bias (b) are used to carry out the mapping tasks and generally, the aim of training is to reach the best values for weights and bias. To obtain the weights and bias, a soft margin (i.e., a possible acceptable interval) is defined as

$$\varepsilon - \xi_{g}^{ - } \le f\left( {B_{g} } \right) - T_{g} \le \varepsilon + \xi_{g}^{ + } ,$$
(8)

where ε is the absolute acceptable difference between the target values and estimated ones; while ξg is the generated loss in the gth sample from the training dataset. In other words, the loss function is defined as:

$$Loss\left( {f\left( {B_{g} } \right),T_{g} } \right) = \left\{ {\begin{array}{*{20}c} {0\quad \quad \quad \quad \quad \;\;} & {\left| {f\left( {B_{g} } \right) - T_{g} } \right| \le \varepsilon } \\ {\left| {f\left( {B_{g} } \right) - T_{g} } \right| - \varepsilon } & {{\text{otherwise}}\quad \quad \quad } \\ \end{array} } \right.$$
(9)

From (9), considering the principle of structural risk minimization, the following minimization problem leads to an optimum hyperplane or weights:

$$\begin{gathered} \mathop {\min }\limits_{{\left( {w,\xi_{g}^{ + } ,\xi_{g}^{ - } } \right)}} \,\,\frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{g = 1}^{G} {\left( {\xi_{g}^{ + } + \xi_{g}^{ - } } \right)} \hfill \\ {\text{subject}}\,{\text{to}}\left\{ {\begin{array}{*{20}c} { - f\left( {B_{g} } \right) + T_{g} + \varepsilon + \xi_{g}^{ + } \ge 0} & {\forall g{,}} \\ {f\left( {B_{g} } \right) - T_{g} + \varepsilon + \xi_{g}^{ - } \ge 0} & {\forall g{,}} \\ {\xi_{g}^{ - } \ge 0{,}\xi_{g}^{ + } \ge 0} & {\forall g} \\ \end{array} } \right. \hfill \\ \end{gathered}$$
(10)

Because of the complexity of the above model and the need to minimize the bias term (b) in the objective function, the corresponding dual form of (10) is often used in the SVR training instead of the primal model in (10). The new dual optimization problem of which the Karush–Kuhn–Tucker (KKT) conditions are utilized in the constraints is defined as follows:

$$\begin{gathered} \mathop {\min }\limits_{{\left( {\alpha_{g}^{ + } ,\alpha_{g}^{ - } } \right)}} \,\,\frac{1}{2}\left( {\alpha^{\prime}H\alpha } \right) + \tilde{q}\alpha \hfill \\ {\text{subject}}\,{\text{to}}\left\{ \begin{gathered} \sum\limits_{g = 1}^{G} {\left( {\alpha_{g}^{ + } - \alpha_{g}^{ - } } \right) = 0{,}} \hfill \\ 0 \le \alpha_{g}^{ + } \le C\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\forall g{,} \hfill \\ 0 \le \alpha_{g}^{ - } \le C\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\forall g{,} \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(11)

where \(\alpha = \left[ {\alpha_{1}^{ + } ,\alpha_{2}^{ + } ,...,\alpha_{G}^{ + } ,\alpha_{1}^{ - } ,\alpha_{2}^{ - } ,...,\alpha_{G}^{ - } } \right]^{{}}\) is a 2G × 1 vector and \(\tilde{q}\) = [− T1 + ε, − T2 + ε, …,− TG + ε, T1 + ε, T2 +  ε, …, TG + ε] is a 1 × 2G vector. Also, \(H = \left( {\begin{array}{*{20}c} h & { - h} \\ { - h} & h \\ \end{array} } \right)\) is a 2G × 2G matrix of kernel transformation with h(a,b) = ϕ(B(a),B(b)). Hence, the above optimization problem in (11) has 2G variables including α = [α1+, α2+,…, αG+, α1, α2,…, αG]T. The quadratic programming algorithms such as kernel adatron (KA), sequential minimal optimization (SMO), iterative single data algorithm (ISDA) and so forth are used to solve the above problem. Then, the weights and bias, or equivalently, the estimation of each observation are obtained using the following relations:

$$\begin{aligned} S = & \left\{ {g|0 < \alpha_{g}^{ + } - \alpha_{g}^{ - } < C} \right\}, \\ b = & \sum\limits_{s \in S}^{{}} {T_{s} - \left( {\sum\limits_{s \in S} {\left( {\alpha_{s}^{ + } - \alpha_{s}^{ - } } \right)\phi \left( {B_{g} ,B_{s} } \right) - \left( {\varepsilon \times sign\left( {\alpha_{s}^{ + } - \alpha_{s}^{ - } } \right)} \right)} } \right),} \\ f\left( {B_{g} } \right) = & \sum\limits_{s \in S}^{{}} {\left( {\alpha_{s}^{ + } - \alpha_{s}^{ - } } \right)\phi \left( {B_{g} ,B_{s} } \right) + b} , \\ g = & 1,2,...,G. \\ \end{aligned}$$
(12)

In (12), S is the support vector set which is usually a small subset of vectors from the training dataset. The magnitude of S or the number of support vectors still depends on hyperparameters including C, ε, kernel function and the structure of the problem. It adjusts the model accuracy by a trade-off between a high-complexity model (increasing the number of support vectors) which may over-fit the data and a large-margin (decreasing the number of support vectors) which will incorrectly classify some of the training data in the interest of better generalization.

2.4 Evolutionary SVR

The combination of meta-heuristic and evolutionary algorithms (EA) with machine learning techniques has received a great deal of attention in the past decade. The main aim of this hybridisation is to use EA in the training or parameter tuning of a machine learning technique (Ojha et al. 2017). As a pioneer work, Kim and Cho (2008) proposed an evolutionary neural network based on the genetic (GA) algorithm in a way that a speciation-based model was established through fitness sharing and then ANN was incorporated by a behaviour knowledge space method. Due to promising results, several other researches have investigated the performance of different EAs such as extended marine predators algorithm (EMPA), gradient-based optimization (GBO), moth-flame optimization (MFO) and water cycle optimization algorithm (WCA) (Adnan et al. 2021a; Ikram et al. 2022a; Kadkhodazadeh and Farzin 2022). The integration of some EA has been recently extended in the literature; for example, Adnan et al. (2021b) implemented the combination of PSO and grey wolf optimization (GWO) algorithms in the training of extreme learning machine (ELM) technique.

Similarly, the paradigm evolutionary SVR (ESVR) refers to the condition of hybridization between EA and SVR. This has been well-received in the literature and can be categorized in three different groups. The first group used EA in the hyperparameter optimization of SVR, see for instance, Adnan et al. (2022), Wang and Du (2014), Ikram et al. (2022b) and Al-Zoubi et al. (2021). In their studies, the optimum values for C, ε and kernel function are acquired with different EAs. In the second group, the EAs were performed with feature selection in combination with parameter optimization in the SVR training (Al-Zoubi et al. 2018; Ziani et al. 2017). In accordance with this paper’s objective, in the third group, some researchers have used either the primal or dual problem (see, Eqs. (10) and (11)), with EAs instead of common quadratic problem solvers. However, Arana-Daniel et al. (2016), Zhang et al. (2016) and Dantas Dias and Rocha Neto (2017) used EAs entailing GA, differential evolution (DE), PSO and simulated annealing (SA) in the support vectors identification. One may ask about the preference or application of EAs in comparison with common quadratic solvers. Arana-Daniel et al. (2016) and Dantas Dias and Rocha Neto (2017) reported that the EAs have less computational complexity and are easier to implement than other techniques. With that being said, it is important to note that there is no definite solution to this question and it depends on the nature of the problem.

2.5 PSO algorithm

Note that EA is used in the third group of the ESVR mentioned in the previous subsection. In other words, solving of the dual problem defined in (11) is performed with PSO as one of the versatile algorithms. By questing the related literature, it was revealed that GA and PSO are the most widespread approaches in this application, with PSO more compatible than GA with continuous variables; for example, the reasonable accuracy of PSO in comparison with some other EAs has been reported in Dantas Dias and Rocha Neto (2017). Also, our simulations reveal that PSO made better accuracy than some other common EA techniques (a brief part of the results are illustrated in the sensitivity analysis section). Therefore, in this paper, PSO is used for the solving of SVR optimization problem.

The PSO idea was extracted from migration of a crowd of birds, where the individual knowledge and performance are determined based on the whole population. As a general procedure in metaheuristic algorithms, the best solution is acquired by generation of supreme solutions from a specific population. A specific location and velocity which is determined based on its own best solution, the global current best solution, and some random parameters are considered for each candidate solution (sometimes called particle) in the PSO algorithm. By reaching the best location and velocity, PSO has been made an excellent evolutionary algorithm in continuous variables or nonlinear optimization problem. For more details about the PSO updating relations, the readers are referred to Kennedy (2010). It is inevitable to assign some hyper parameters in the PSO algorithm including (i) the population size, denoted by npopPSO here, (ii) iteration numbers (maxItPSO), (iii) and (iv) two coefficients for computation of the difference between the current and best position of this solution and best position of all solutions, respectively.

Suppose our proposed method denoted by ESVR hereafter is available after the training procedure. Generally, some features or characteristics of a generated profile are extracted and imported to the ESVR and the condition of the process as an IC or OOC is identified by its output. In this paper, the ESVR output denoted by Oj is compared with a predefined cutting value (CV) which is considered as upper control limit (UCL) in common control charts. By our training procedure, it is not needed to define the lower control limit (LCL); this adaptation is consistent with some previous works; for example see Hosseinifard et al. (2011) in which they adjusted LCL at 0. For better understanding, Fig. 1 depicts the conceptual model of ESVR for deciding about the process condition.

Fig. 1
figure 1

Basic model of ESVR for determination of process condition at the jth generated profile in Phase II

The determination of input features would be the next key step; however, not much attention was given to it in the previous machine learning based control charts. For example, the estimation of parameters were only imported as the input features in Hosseinifard et al. (2011); note though, adding other input features would be significant for better detection ability of OOC profiles in general. The proper input features not only should be indicator of the process parameters properties but also have to embrace the effect of the former samples in the contemporaneous statistic. As one of the contributions of this study, the input features of the jth generated profile consist of four major groups:

  • The normalized estimated parameters: After estimation of the parameters by IWLS algorithm and considering a p-dimensional normal distribution as \(\mathop {{\beta}} \limits^{\frown }_{j} \sim N_{p} \left( {{\varvec{\beta}}_{0} ,\left( {\tilde{{{\varvec{X}}^{\prime } }}W\tilde{{\varvec{X}}}} \right)^{ - 1} } \right)\) in Yeh et al. (2009), the estimation of the parameters are scaled through (13) (more details are provided in Sect. 4.2 of Johnson and Wichern (2007):

    $$\begin{gathered} {\varvec{\beta}} _{{\varvec{j}}}^{\prime } = \left( {\tilde{{\varvec{X}}}^{\prime } W\tilde{{\varvec{X}}}} \right)^{{ - \frac{1}{2}}} \times \left( {\mathop {{\varvec{\beta}} _{{\varvec{j}}} }\limits^{\frown } - {\varvec{\beta}} _{0} } \right)^{T} , \hfill \\ {\varvec{\beta}} _{{\varvec{j}}}^{\prime } = \left( {{{\beta}} _{{1j}}^{\prime } ,\beta _{{2j}}^{\prime } ,...,\beta _{{pj}}^{\prime } } \right). \hfill \\ \end{gathered}$$
    (13)
  • The normalized average of responses parameters: Considering (2), we have yij \(\sim\) Poisson(µij) for i = 1,2,…,n. Although the exact distribution of the average of the responses in the jth profile is not known, the central limit theorem enables us to scale it as:

    $$\overline{y}_{j}^{\prime } = \frac{{\overline{y}_{j} - \frac{{\sum\limits_{i = 1}^{n} {\mu_{0i} } }}{n}}}{{\frac{{\sqrt {\sum\limits_{i = 1}^{n} {\mu_{0i} } } }}{n}}},$$
    (14)

    in such a way that

    $$\overline{y}_{j} = \frac{{\sum\limits_{i = 1}^{n} {y_{ij} } }}{n},$$
    (15)

    where log(µ0) and µ0 have been defined after (3). It is better to import the above parameters as an EWMA form; in other words, an EWMA form of p + 1 parameters \(\left( {EWMA_{Pj} = \left[ {\beta_{1j}^{\prime } ,\beta_{2j}^{\prime } ,...,\beta_{pj}^{\prime } ,\overline{y}_{j}^{\prime } } \right]} \right)\) are computed in each generated profile as in (4) with the initial values [0,0,…,0]p+1.

  • The ratio of MEWMA statistics: The better detection ability of the runs-rules monitoring scheme proposed by Yeganeh and Shadman (2020) and Yeganeh et al. (2021) led us to adopt a similar approach in this paper. They applied the ratio of points as a supplementary tool to increase the chart’s performance. Because of the complexity in the design of run-rules schemes, it is recommended to use the ratio of points in the input features. To this end, the UCLMEWMA are obtained by specifying a desired ARL0 considering only the MEWMA chart (this has been reported in Table 2 of Qi et al. (2016) for ARL0 equal to 370). Then, MEWMA statistics are computed using (5) in the jth profile and the number of samples in the three different regions entailing \(\left( {0,\frac{{UCL_{MEWMA} }}{2}} \right)\), \(\left( {\frac{{UCL_{MEWMA} }}{2},UCL_{MEWMA} } \right)\) and beyond the control limits \(\left( {UCL_{MEWMA} , + \infty } \right)\) which are each denoted as \(d_{MEWMA}^{(1)}\), \(d_{MEWMA}^{(2)}\) and \(d_{MEWMA}^{(3)}\), respectively. The 1 × 4 vector \(\left( {\left[ {\frac{{d_{MEWMA}^{(1)} }}{j},\frac{{d_{MEWMA}^{(2)} }}{j},\frac{{d_{MEWMA}^{(3)} }}{j},M_{j} } \right]} \right)\) is imported to the ESVR to incorporate the effect of previous samples, in similar fashion as run-rules.

  • The ratio of LRT statistics: Because LRT chart has superior performance than MEWMA chart for detection of large shifts of Poisson profiles (see for example, Table 3 of Qi et al. (2016)), this statistic is also added in the input features by definition of UCLLRT. Hence, using a similar approach as in the previous point and using (6), the 1 × 4 vector \(\left( {\left[ {\frac{{d_{LRT}^{(1)} }}{j},\frac{{d_{LRT}^{(2)} }}{j},\frac{{d_{LRT}^{(3)} }}{j},LRT_{j} } \right]} \right)\) is computed and imported to the ESVR.

By these definitions, ESVR has a (p + 1 + 4 + 4)-dimensional input vector \(\left( {I_{j} = \left[ {EWMA_{Pj} ,\frac{{d_{MEWMA}^{(1)} }}{j},\frac{{d_{MEWMA}^{(2)} }}{j},\frac{{d_{MEWMA}^{(3)} }}{j},M_{j} ,\frac{{d_{LRT}^{(1)} }}{j},\frac{{d_{LRT}^{(2)} }}{j},\frac{{d_{LRT}^{(3)} }}{j},LRT_{j} } \right]} \right)\). Many investigations were conducted to reach the above four groups which are the best input combinations for reaching minimum ARL1. It is discussed more in the sensitivity analysis by comparing with other ESVRs that have different input structures.

Also, there are some interesting results about the proposed (p + 9)-dimensional input vector (i.e., Ij) of ESVR that were obtained based on simulations (these results are not reported here to conserve space). First, to estimate W defined after (3), it has been suggested to consider the current profile information instead of the IC values as it was proposed by a number of researchers (see, e.g. Huwang et al. (2016)); whereas, the simulations study revealed better results regarding the IC values; hence, we computed W with IC model instead of current profiles. The same can be said about the estimation of \(\overline{y}_{j}^{\prime }\) in (14). The second point is about the ratio of samples beyond the control limits \(\left( {\frac{{d_{MEWMA}^{(3)} }}{j},\frac{{d_{LRT}^{(3)} }}{j}} \right)\). The main reason of including it in the proposed method is to improve the robustness of ESVR to large simultaneous shifts. In other words, the effect of MEWMA ratios \(\left( {\frac{{d_{MEWMA}^{(1)} }}{j},\frac{{d_{MEWMA}^{(2)} }}{j}} \right)\), LRT ratio shifts \(\left( {\frac{{d_{LRT}^{(1)} }}{j},\frac{{d_{LRT}^{(2)} }}{j}} \right)\), and beyond the control limit ratios \(\left( {\frac{{d_{MEWMA}^{(3)} }}{j},\frac{{d_{LRT}^{(3)} }}{j}} \right)\) are present in small, large single and large simultaneous shifts, respectively. As a third point, one may suggest to use WLRT statistics instead of LRT or MEWMA because of its superiority in performance. This is also possible in this paper and may improve the results; but, WLRT needs more computational time because of its full use of all available samples up to the current time point j, especially for the detection of small shifts.

To conduct the simulations, a (p + 9)-dimensional input vector of ESVR (Ij) is computed in each generated profile and imported to the ESVR. Then, Oj is compared with CV and an OOC signal is triggered when CV < Oj (see Fig. 1). To compute the ARL1 and SDRL1, this procedure iterates in several Monte Carlo simulations and the signalling times are stored as the run lengths (RL). For a better illustration, the process of obtaining ARL1 and SDRL1 for a desired shift with MaxIt iterations is illustrated in Pseudocode 1.

figure a

3 Training of the proposed method

In the previous section, it has been assumed that the ESVR has been trained. To train it, a G × (p + 10) training dataset similar to the one in Hosseinifard et al. (2011) is generated. To this end, 0.5G IC profiles and 0.5G OOC profiles (with some desired shifts) are generated and a (p + 9)-dimensional input vector is computed for each generated profile. The target values for IC and OOC profiles are 0 and 1, respectively, or \(T_{1} = T_{2} = ... = T_{\frac{G}{2}} = 0;T_{{1 + \frac{G}{2}}} = T_{{2 + \frac{G}{2}}} = ... = T_{G} = 1\).

As mentioned previously, the dual problem in (11) is solved using PSO algorithm and the optimum (minimum) value of objective function is reached by assigning the values of existing 2G variables \(\left( {\alpha = \left[ {\alpha_{1}^{ + } ,\alpha_{2}^{ + } ,...,\alpha_{G}^{ + } ,\alpha_{1}^{ - } ,\alpha_{2}^{ - } ,...,\alpha_{G}^{ - } } \right]} \right)\). But, as one of the contributions of this paper, some scanty changes are added to the objective function of (11). This model was introduced by Zhang et al. (2016) in which an additional coefficient is added to objective function of primal problem.

Before the identification of the additional terms to objective function, it is better to be aware of the challenge on the relation between common accuracy criteria entailing mean square error (MSE) and ARL when designing control charts based on machine learning techniques. In a usual ANN, SVM, SVR and other machine learning techniques, the training process continues until reaching a desired threshold; whereas in the case of classical Phase II, control charts are evaluated in terms of the ARL and there is no direct relationship between the two approaches. This challenge has been discussed in details by Yeganeh and Shadman (2020) who suggested heuristic solution similar to the design of experiment approach for ANN training which is not the focus in this paper.

Since the 0 and 1 values have been assigned for the IC and OOC profiles as the target values, respectively, and the process condition is identified through the CV, it was observed that the higher performance (i.e., lower ARL1 for a desired ARL0) occurs when the difference between OOC and IC estimated target values is at the maximum value. In other words, in a common situation, the outputs of ANN or SVR tend towards 0 and 1 in IC and OOC profiles and the CV is obtained closer to 1 to reach a desired ARL0 (see Hosseinifard et al. (2011)) while the higher the difference between the outputs, the lower the ARL1 value. Thus, some criteria are needed to depict the significance of the difference between IC and OOC ESVR outputs.

To this end, the output of each input in the dataset is obtained using (12). Suppose that it is denoted by \(\hat{T}_{g} ;g = 1,2,...,G\) which is equivalent to f(Bg) given in (12) where \(g = 1,2,...,\frac{G}{2}\) are the predicted IC values and others are for OOC profiles. Therefore, the dual problem can be revised as follows:

$$\begin{gathered} \mathop {\min }\limits_{{\left( {\alpha_{g}^{ + } ,\alpha_{g}^{ - } } \right)}} \,\,MSE + DAVE + DR \hfill \\ {\text{subject}}\,{\text{to}}\left\{ \begin{gathered} \sum\limits_{g = 1}^{G} {\left( {\alpha_{g}^{ + } - \alpha_{g}^{ - } } \right) = 0{,}} \hfill \\ 0 \le \alpha_{g}^{ + } \le C\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\forall g{,} \hfill \\ 0 \le \alpha_{g}^{ - } \le C\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\forall g{,} \hfill \\ \end{gathered} \right. \hfill \\ MSE = \frac{{\sum\limits_{g = 1}^{G} {\left( {T_{g} - \hat{T}_{g} } \right)^{2} } }}{G}{,} \hfill \\ DAVE = \frac{{\sum\limits_{g = 1}^{\frac{G}{2}} {\left( {\hat{T}_{g} } \right)} }}{\frac{G}{2}} - \frac{{\sum\limits_{{g = 1 + \frac{G}{2}}}^{G} {\left( {\hat{T}_{g} } \right)} }}{\frac{G}{2}}{,} \hfill \\ DR = \mathop {range}\limits_{{g = 1,2,...,\frac{G}{2}}} \left( {\hat{T}_{g} } \right)\,\,\, - \mathop {range}\limits_{{g = 1 + \frac{G}{2},2 + \frac{G}{2},...,G}} \left( {\hat{T}_{g} } \right){.} \hfill \\ \end{gathered}$$
(16)

The proposed objective function consists of three components. The first one is the common MSE criteria which is frequently utilized in machine learning applications but as mentioned, it cannot lead to minimum ARL1; hence, the two other components entailing the difference of averages (DAVE) and the difference between the range (i.e., maximum–minimum) of outputs (DR) are appended to the objective function. It is proved in simulation studies that the proposed training approach converges to a solution with minimum ARL1 values or equivalently, quicker OOC detection ability. Based on our simulations, the effect of these is that the IC and OOC outputs have the maximum difference which will lead to minimum ARL1 while MSE scales the outputs and hinders increasing the output values. Note that, the obtained CV can be increased above 1 to reach a specific ARL0.

The ideal values of MSE, DAVE and DR values are 0, − 1 and − 1, respectively, when \(\hat{T}_{1} = \hat{T}_{2} = ... = \hat{T}_{\frac{G}{2}} = 0\);\(\hat{T}_{{1 + \frac{G}{2}}} = \hat{T}_{{2 + \frac{G}{2}}} = ... = \hat{T}_{G} = 1\); thus, the best value of the proposed objective function is -2. Note that the first term of primary objective function in (11) (i.e. \(0.5(\alpha^{\prime}H\alpha ) + \tilde{q}\alpha\)) may also be included in (16); however, this is not recommended since it is a redundant term with the same outcome in this condition and it only makes it more complex.

Considering a training dataset with G elements, the optimization problem in (16) with 2G variables should be solved by PSO. To this end, initial solutions with size npopPSO are randomly generated and updated with PSO algorithm in a way that each member of the population is a vector with size 2G. By changing the variables’ values \(\left( {\alpha_{g}^{ + } ,\alpha_{g}^{ - } } \right)\) in (16) during the PSO implementation, the output of each input \(\left( {\hat{T}_{g} ;g = 1,2,...,G} \right)\) is computed by (12); in other words, function evaluation process (obtaining objective function) is carried out by computation of \(\hat{T}_{g}\) for g = 1, 2,…, G. The process terminates when the objective function reaches its ideal value or BestSol (i.e., − 2) or the iteration number exceeds maxItPSO. The framework of ESVR training is illustrated in Pseudocode 2.

figure b

After training the ESVR, the CV value is adjusted such that the desired ARL0 is reached. It is done using the algorithm provided in Pseudocode 1, on the condition that there are no shifts in the profile generation. Note that UCLMEWMA and UCLLRT are constant during the training.

4 Performance comparisons

Motivated by Qi et al. (2016) and Shang et al. (2018), three different OOC situations entailing parametric model with fixed explanatory variable, parametric model with nonfixed explanatory variable and non-parametric model have been simulated in this section to evaluate the performance of the proposed approach (the competitors’ results are extracted from the above mentioned references). The model parameters are provided in Table 1.

Table 1 The preassigned model parameters

Based on the Qi et al. (2016), the IC model for the fixed and random design points was assumed as:

$$\begin{gathered} {\varvec{\beta}}_{{\varvec{0}}} = [1\,\,1]{,} \hfill \\ \tilde{\varvec{X}^{\prime}} = \left( {\begin{array}{*{20}c} 1 & 1 & \cdots & 1 \\ {0.1} & {0.2} & \cdots & 1 \\ \end{array} } \right){,} \hfill \\ n = 10{,}p = 2. \hfill \\ \end{gathered}$$
(17)

The OOC profiles parameters (denoted as βOOC) were generated such that

$$\begin{gathered} {\varvec{\beta}}_{{{\varvec{OOC}}}} = {\varvec{\beta}}_{{\varvec{0}}} + \Delta {,} \hfill \\ \sigma_{1} = 0.3518{,}\sigma_{2} = 0.5095{,} \hfill \\ \end{gathered}$$
(18)

where \(\Delta = \left( {\delta_{1} \sigma_{1} ,\delta_{2} \sigma_{2} } \right)\) represents the magnitude of shifts in terms of standard deviation. To generate training dataset, in addition to 1200 IC profiles, three different conditions with 400 OOC profiles with the magnitude of shifts δ1 = 0.2, δ2 = 0.2 and δ1 = δ2 = 0.2 were generated when G = 2400.

4.1 ARL1 values for the fixed design points condition

Four competitors (namely: LRT, MEWMA, LRT-EWMA and WLRT schemes) were compared with the proposed ESVR method. After the training procedure, the CV was set equal to 2.12 by simulation to reach ARL0 = 370 (i.e., implementation of Pseudocode 1’s procedure with no shift). Table 2 reports the values of ARL1 and SDRL1 in parentheses at different shift magnitudes. It is noteworthy to mention that the boldfaced values denote the best performing scheme.

Table 2 Comparison of ARL1 (SDRL1) for the Poisson profiles with fixed design points

The proposed ESVR scheme yielded lower values of ARL1 and SDRL1 regardless of the size of the shifts which gave an advantage to ESVR over other competitors. Tangible reduction in the values of ARL1 and SDRL1 can be seen for most of the shifts; for example, the ARL1 in the first row were obtained as 30.1, 44.8, 153, 201 and 365 for ESVR, WLRT, LRT-EWMA, LRT and MEWMA schemes, respectively. Comparable performance in a wide range of shifts indicated that the training procedure of the ESVR would be very good as it made ESVR a robust control chart over different types of shifts. In other words, although only one shift magnitude has been taught to the ESVR during the training, it detected other OOC shifts as quick as possible.

As another finding, the MEWMA and LRT schemes have weaker performances than the WLRT for most of the shifts, which reveals that the combination of two or more control charts (such as combination of EWMA and LRT for construction of WLRT and LRT-EWMA) could increase the performances of the resulting methods (Qi et al. 2017). It is more evident that the WLRT performs much better, especially for small and moderate shifts; for instance, when (δ1 = 0.2, δ2 = 0) the ARL1 and SDRL1 of the WLRT are 5 times less than that of the LRT. This idea can be extended to the ESVR schemes to increase their performances. Based on this fact, we can conclude that one of the main reasons for the superior performance of ESVR could be the combination of the LRT and MEWMA statistics.

4.2 ARL1 values for the random design points condition

Similar to Qi et al. (2016), random explanatory variables with n = 9 were generated using the same IC model. To this end, a random number from a discrete uniform distribution over the integers from 1 to 10 were selected and afterwards deleted to construct a random design point with n = 9. To make a more robust scheme, a new ESVR was not trained in this case and only the data generation procedure was changed based on the random design points. The results of this case are gathered in Table 3.

Table 3 Comparison of ARL1 (SDRL1) for the Poisson profiles with random design points

Since the same control limits from the previous subsection were used, the ARL0 is not exactly equal to 370 (the results of IC situation are shown in the first row). The OOC results revealed the superiority of the newly proposed ESVR in case of random design points over other competitors with similar properties as the one for fixed design points. Regardless of the size of the shift, the ESVR performs better followed by the WLRT chart in terms of the ARL values. On the other hand, the performance in term of SDRL was higher as compared to that of its competitors for moderate and large shifts; while, the WLRT performed better among all competing charts for small shifts. We can refer, for instance, to the shift with magnitude (δ1 = 0.2, δ1 = 0) in which the minimum SDRL1 i.e., 45.2 was related to WLRT. The same conclusions can be drawn for smaller shifts. Shang et al. (2011) and Song et al. (2021) hinted to complexity of motioning profiles with random design points. Comparing the results of Tables 2 and 3, it can be seen that both ARL1 and SDRL1 are much larger for a random design, which confirms Shang et al. (2011) and Song et al. (2021)’ s findings.

4.3 ARL1 values for the non-parametric condition

Non-parametric monitoring refers to the OOC conditions where the type of OOC profile is not known and the IC model can transform to any possible shape. Due to whole changing in the profile relationship, the OOC situation is usually denoted with some scenarios in non-parametric conditions (Zou et al. 2008; Shang et al. 2018; Abbasi et al. 2022). Note that there is no research in non-parametric Poisson profiles so all the values have been achieved by our simulations. To simulate this situation, two different OOC scenarios were investigated with the fixed design points and the IC model of (17). Equations (19) and (20) represent the OOC model in each scenario. Equation (19) presents the OOC model of scenario I.

$$\begin{gathered} {\varvec{Y}}_{{\varvec{j}}} = Poisson\left( {{\varvec{\mu}}_{{\varvec{j}}} } \right){,} \hfill \\ j = 1{,}2{,}....{,} \hfill \\ log\left( {{\varvec{\mu}}_{{\varvec{j}}} } \right) = \tilde{\varvec{X}\beta }_{{\varvec{j}}} \varvec{ + }\delta_{3} {\varvec{cos}}\left( {{\varvec{2}}\pi \tilde{\varvec{X}^{\prime}}_{pure} } \right){,} \hfill \\ \end{gathered}$$
(19)

where \(\tilde{\varvec{X}^{\prime}}_{pure}\) is the indicator of the explanatory variables without intercept term or \(\tilde{\varvec{X}^{\prime}}_{pure} = \left( {\begin{array}{*{20}c} {0.1} & {0.2} & \cdots & 1 \\ \end{array} } \right)\). The results of ARL1 (SDRL1) for this scenario are displayed in Table 4.

Table 4 Comparison of ARL1 (SDRL1) for the non-parametric Poisson profiles in scenario I

Equation (20) provides the OOC model of scenario II. The results of ARL1 (SDRL1) for this scenario are displayed in Table 5.

$$\begin{gathered} {\varvec{Y}}_{{\varvec{j}}} = Poisson\left( {{\varvec{\mu}}_{{\varvec{j}}} } \right){,} \hfill \\ j = 1{,}2{,}....{,} \hfill \\ log\left( {{\varvec{\mu}}_{{\varvec{j}}} } \right) = \tilde{\varvec{X}\beta }_{{\varvec{j}}} \varvec{ + }\frac{{\delta_{3} }}{{\tilde{\varvec{X}^{\prime}}_{pure} }}{,} \hfill \\ \tilde{\varvec{X}^{\prime}}_{pure} = \left( {\begin{array}{*{20}c} {0.1} & {0.2} & \cdots & 1 \\ \end{array} } \right){.} \hfill \\ \end{gathered}$$
(20)
Table 5 Comparison of ARL1 (SDRL1) for the non-parametric Poisson profiles in scenario II

Comparing two scenarios, sooner OOC detection was raised in the first scenario for nearly all control charts; for example, the ARL1 (SDRL1) for LRT scheme were 138.0 (127.7) and 228.9 (222.4), respectively, in each scenario. It indicates that the cyclic patterns based on the Eq. (19) can be detected easier in comparison with OOC term in Eq. (20). The ESVR scheme turned out to be the best among all other competing schemes. In general, compared with the LRT and MEWMA schemes, the ESVR scheme has both robustness and sensitivity to the complete changes in the profile type for different shifts; whereas, the MEWMA has lower SDRL1 for small shifts of scenario I. That said, its performance was not as comparable as in the second scenario, which is an indicator of lack of robustness in terms the size of the shift. One of the main reasons for this phenomenon is that most of the existing statistical control charts have been extended based on some fundamental assumptions about the properties of the process (Montgomery 2019; Gupta et al. 2006). While the process condition completely follows these assumptions, the statistical control charts can perform well but it deteriorates its detection ability in other situations such as non-parametric models, complex relation forms and so forth. In these conditions, several researches have stated that machine learning techniques could be superior than statistical methods and had more robustness (Yeganah et al. 2022a; b; Pacella and Semeraro 2011; Mohammadzadeh et al. 2021; Chen et al. 2020). As it was expected that ESVR scheme, as a machine learning technique, would outperform the statistical approaches due to its nature, the lowest values of ARL1 and SDRL1 have belonged to it.

5 Sensitivity analysis

This section provides six different sensitivity analyses. First, the effect of the proposed input structure and training algorithm against other machine learning techniques are evaluated. Secondly, the detection ability with other desired ARL0 is reported. Thirdly, a sensitivity analysis for different n values is performed. In the fourth part, the detection ability is increased with some run-rules. The effect of PSO in training of ESVR is investigated in the fifth part and finally, the best performance of the proposed input structure is depicted on the last subsection. All the simulations have the same setups as in Sect. 5.1.

5.1 ARL1 comparisons under different machine learning techniques

To show the capability of the proposed input layer and training method, four different scenarios were designed with ANN and usual SVRs. In the first state, a common ANN with back-propagation algorithm called ANN-BP1 was trained similar with Hosseinifard et al. (2011) i.e. the inputs were estimation of coefficients. Then, the proposed input layer structure was also trained in a similar way with ESVR (i.e., assigning 11 neurons in input layer) called ANN-BP2. Both techniques have been trained with ‘feedforwardnet’ function in MATLAB 2018 and have two hidden layers. Moreover, two SVRs called SVR1 and SVR2 have been trained with similar inputs with ANN-BP1 and ANN-BP2 with ‘fitrsvm’ function. By these adjustments, ANN-BP1 and SVR1 assessed the proposed input structure while the performance of training method was evaluated under ANN-BP2 and SVR2. The results of these setups are displayed in Table 6.

Table 6 Comparison of ARL1 (SDRL1) for different machine learning schemes in fixed design points

As it can be seen, the ESVR outperformed ANN-BP1, ANNBP2 and SVR2 for most of the shifts; whereas, for small single positive shifts, it had weaker performance than SVR1. Although the ESVR had lower ARL1 values than SVR1 for most of the shifts, the simple training of SVR1 might violate the efficiency of ESVR. By more concentration in ARL1 of different shifts including negative shift and simultaneous positive and negative shifts (the results were not shown), it was revealed that SVR1 came across with the bias effect which means the inability of detecting some shifts (this is also shown for ANN-BP2 in the last two rows of Table 6). As pointed out by Huwang et al. (2014), machine learning techniques suffers from bias effect which means that for such control charts, the OOC signals are not triggered for some shifts; thus, some remedial actions should be considered. However, this bias and poor detection ability did not exist in the case of ESVR scheme.

5.2 ARL1 comparisons under an ARL0 value of 200

In Fig. 2, the simulation adjustments were carried out to have ARL0 = 200 under the fixed design point and the ARL1 and SDRL1 are reported in Panels (a) and (b), respectively. Figures 2a, b illustrate that the ESVR scheme remains superior in the new condition, which reveals the apparent complete robustness of ESVR scheme in terms of the type I error and/or ARL0 (Abbas et al. 2016). The comparative analysis in Fig. 2 remains valid for other different values of ARL0 but they are not reported for the sake of brevity. Note that the results were obtained using the previous ESVR training where the CV was decreased to reach the desired ARL0 value of 200.

Fig. 2
figure 2

The results of a ARL1 and b SDRL1 values for MEWMA, LRT and ESVR methods when ARL0 = 200

5.3 ARL1 Comparisons under different n values

To study the effect of different sample sizes with ARL0 = 200, we also set n = 5, 15 and 20 with the step 0.1; for example, \(\tilde{\varvec{X}^{\prime}}_{pure}\) = (0.1 0.2 … 1.5) when n = 15. Figures 3a, b illustrate some comparisons for different values of n in terms of ARL1 and SDRL1, respectively. The results conducted without a new training were predictable in SDRL1 while the ARL1 values had outstanding pattern for some of the moderate shifts because it was expected that the greater the value of n, the lower the ARL1 for a specific shift. Another strange pattern occurred in the third and fourth shift when n = 5 (blue line). In this state, the larger shifts had larger ARL1 and SDRL1. It may be because of small sample sizes and biased estimated parameters. It was mentioned by Montgomery (2019), Haq (2020) and Abbasi et al. (2022) that detection of OOC condition in process with variable sample size provides some challenges such as variance inflation and it is necessary to employ some adaptive schemes for statistical control charts to reduce this effect. The results indicated that ESVR can perform better as an adaptive scheme in case of variable sample size. Conceptually speaking, the value of n is known in Phase II so control charts are extended with permanent n values and it is more practical to use ESVR in this situation with specific training for each n. However, this is not the focus of this paper, due to brevity and will be reported in the future studies.

Fig. 3
figure 3

The results of a ARL1 and b SDRL1 values for different n when ARL0 = 200

5.4 Increasing the detection ability with run-rules

To improve the sensitivity of control charts, many techniques (or methods) including runs-rules, adaptive methods variable sampling designs and mixed procedures have been recommended in the area of profile monitoring (Haq 2020; Mohammadzadeh et al. 2021). For instance, adding runs-rules to the basic control chart can increase its ability to quickly detect shifts of different magnitudes. Employing of variable sampling interval (VSI) technique in which it is allowed to take samples with shorter intervals in case existing potential for a shift in IC model, while samples are taken with longer interval in the routine situations, has been utilised in the area of profile monitoring. Also, some other adaptive methods including modified successive sampling, ranked based and set approaches have been considered in profile monitoring (Maleki et al. 2018; Woodall 2007) but as a more profitable technique, run-rules was extended by Yeganeh et al. (2021), Yeganeh and Shadman (2020) and Yeganeh and Shadman (2021) in which they were used in form of a rule matrix. To construct a rule matrix, the IC region is divided into several regions by a heuristic approach. After defining these, then the number of regions and the ratio of points in them are reconciled with prespecified values (i.e., thresholds) and an OOC signal is obtained when they locate beyond these thresholds.

To investigate its effect in combination of ESVR, in this paper, the proposed ESVR method supplemented with rule matrix (denoted as ESVR-RULE). For this aim, the ratio of ESVR statistic (Oj) in the run-rule regions was computed and compared with the limits of rule matrix. The details of rule matrix were not given for the sake of brevity. For comparison purposes, the combination of rule matrix and MEWMA scheme (denoted with MEWMA-RULE) was also provided. Figures 4a, b illustrate the results of ESVR, ESVR-RULE, MEWMA and MEWMA-RULE.

Fig. 4
figure 4

The results of a ARL1 and b SDRL1 values for combination of ESVR and MEWMA with run-rules

From Fig. 4, it is observed that run-rules could improve the performance of both MEWMA and ESVR in terms of ARL1 and SDRL1 under small and moderate shifts except for the IC condition. The SDRL0 of the MEWMA-RULE and ESVR-RULE were tangibly greater than MEWMA and ESVR, respectively; however, this may cause some problems in increasing the number false alarms in specific conditions. The second finding of this simulation study was that the run-rules could not be effective for large shifts. This finding is rational as run-rules usually improve detection ability of small shifts (Montgomery 2019). As the last finding, the ESVR-RULE performed better than the MEWMA-RULE which revealed a superior detection ability of the proposed method.

5.5 Effect of EA in training of ESVR

As mentioned in the previous section, the optimization problem in (16) is solved with PSO in ESVR. To show the superiority of PSO over other common EAs, three other well-known EA entailing GA, DE and SA was also employed. For this aim, all the designing steps between this method were similar and only they were utilised in solving of Eq. (16). Figure 5 depicts the ARL1 and SDRL1 values for our proposed method or ESVR (PSO), DE, GA and SA.

Fig. 5
figure 5

The results of a ARL1 and b SDRL1 values for different EA in training of ESVR

It is obvious that ESVR (PSO) had the best detection ability in all the shifts in terms of ARL1 while it was also the best approach in small and moderate shifts in terms of SDRL1. But GA had a very small superiority over ESVR (PSO) in large shifts in term of SDRL1. Hence, these and some other similar simulations justified choosing of PSO in our proposed method. However, as stated in the literature (Adnan et al. 2021a; Ikram et al. 2022a; Kadkhodazadeh and Farzin 2022), some EAs such as EMPA, GBO, MFO, WCA and GWO may be highly sensitive to the initial parameters and adjustments. Hence, reaching superior performance over PSO may occur with these approaches under some sensitivity analysis. This idea can be investigated in the future by interested researchers.

5.6 Effect of input features

To show the best performance of our proposed input structure \(\left( {I_{j} = \left[ { \, EWMA_{Pj} \frac{{d_{MEWMA}^{(1)} }}{j},\frac{{d_{MEWMA}^{(2)} }}{j},\frac{{d_{MEWMA}^{(3)} }}{j},M_{j} ,\frac{{d_{LRT}^{(1)} }}{j},\frac{{d_{LRT}^{(2)} }}{j},\frac{{d_{LRT}^{(3)} }}{j},LRT_{j} } \right]} \right)\), some other input combinations were defined as follows:

  • ESVR1: \(I_{j} = [EWMA_{Pj} ]\).

  • ESVR2: \(I_{j} = \left[ {EWMA_{Pj} ,\frac{{d_{MEWMA}^{(1)} }}{j},\frac{{d_{MEWMA}^{(2)} }}{j},\frac{{d_{MEWMA}^{(3)} }}{j},M_{j} } \right]\).

  • ESVR3: \(I_{j} = \left[ {EWMA_{Pj} ,\frac{{d_{LRT}^{(1)} }}{j},\frac{{d_{LRT}^{(2)} }}{j},\frac{{d_{LRT}^{(3)} }}{j},LRT_{j} } \right]\).

  • ESVR4: \(I_{j} = \left[ {\frac{{d_{MEWMA}^{(1)} }}{j},\frac{{d_{MEWMA}^{(2)} }}{j},\frac{{d_{MEWMA}^{(3)} }}{j},M_{j} } \right]\).

  • ESVR5: \(I_{j} = \left[ {\frac{{d_{LRT}^{(1)} }}{j},\frac{{d_{LRT}^{(2)} }}{j},\frac{{d_{LRT}^{(3)} }}{j},LRT_{j} } \right]\).

The training procedure for each of the above input combinations was the same as ESVR and the difference was only related to the input size. Consequently, the dimension of inputs in the train data was p + 1, p + 5, p + 5, 4 and 4 for ESVR1, ESVR2, ESVR3, ESVR4 and ESVR5 respectively. Figure 6 depicts the performance of ESVR approach regarding to each predefined input combination.

Fig. 6
figure 6

The results of a ARL1 and b SDRL1 values for different input combinations

The superiority of ESVR over other input combinations are obvious from Fig. 6. By reduction of each part of input features, we can notice some decreases in the performance of ESVR scheme in term of ARL1 and SDRL1. The ability to identify OOC situations are more apparent for large shifts such as δ1 = δ2 = 0.59 of which ESVR1 was nearly five (ten) times less quick than ESVR in term of ARL1 (SDRL1). As another finding, it could be inferenced that combination control chart statistics and EWMA form of estimated parameters had a strong effect in detection ability due to superiority of ESVR2 and ESVR3 over ESVR1, ESVR4 and ESVR5. It means that the LRT and MEWMA statistics were not solely able to increase detection ability and they required some characteristics of the process to reduce ARL1 and SDRL1. The weak performance of ESVR1 utilized only EWMA form of estimated parameters has also confirmed this argument.

6 Diagnosis aid

In some real cases, the practitioner is interested to identify the parameters that have shifted after an OOC signal has been detected; however, this approach called profile diagnosis has gotten scant attention in the literature of profile monitoring. For example, some statistics have been proposed by Zou et al. (2007), Zou et al. (2008) and Huwang et al. (2016) for the diagnosis of the shift causes in linear, non-parametric and logistic profiles, respectively. Yeganeh and Shadman (2020) introduced a different approach using ANN with signalling rules as a tool for profile diagnosis. However, to the best of the authors’ knowledge, there is no research work on diagnosis for Poisson profiles. In this paper, a novel structure based on a set of SVRs is proposed for diagnosis actions in Poisson profiles.

6.1 Requirements for profile diagnosis actions

There are two key points in the profile diagnosis simulations. Firstly, since profile diagnosis is usually implemented after the change point estimation. This is not part of this paper as we have assumed that all the shifts are considered from the onset of the process. Then, the IC estimated profiles need to be removed or ignored (see, for example, Fig. 9 in Yeganeh and Shadman (2020)) so that the machine learning procedure is based only on OOC samples to identify the parameters that have changed. Secondly, for a fair judgement, it is assumed that the control chart has the same signalling method. For example, similar diagnosis techniques can yield different results because of different signalling methods; see for example, Zou et al. (2007) and Huwang et al. (2014). In this paper, the diagnosis actions are implemented after triggering a signal by ESVR control chart.

6.2 The proposed SVR structure in profile diagnosis

Following Yeganeh and Shadman (2020) model, SVR is used in this paper for profile diagnosis actions. Yeganeh and Shadman (2020) used the EWMA statistics with estimated parameters as the inputs of ANN. Due to consideration of previous samples information, the EWMA statistic considers the change point effect automatically. Thus, after an OOC signal, the EWMAPj \(\left( {EWMA_{Pj} = \left[ {\beta^{\prime}_{1j} ,\beta^{\prime}_{2j} ,...,\beta^{\prime}_{pj} ,\overline{y}^{\prime}_{j} } \right]} \right)\) of the last sample (i.e., the \({j}^{th}\) sample in this case is equal to signalling sample; but hereafter its index is omitted) is considered as the input vector of profile diagnosis. But there is a fundamental difference between ANN and SVR. It is possible to assign different neurons for each parameter in the output layer of ANN while SVR could only generate one output. This is the major challenge that arises when conducting profile diagnosis with SVR as a classification problem. To overcome this problem, we used an approach denoted by SVRS (SVR Set) in a way that one SVR called SVRD (SVR Diagnosis) is trained for each possible change and the existence of each shift is identified by SVRD. In this approach, there are in total p SVRDs in diagnosis process. In other words, SVRS consist of p SVRDs, i.e., SVRD1, SVRD2, …, SVRDp. For example, if we have two parameters in the IC profile, two SVRDs identify the shifted parameters such that the first and second SVRDs are indicators of the shifts in the first and second parameters, respectively. Naturally, the identifications of a shift by both parameters indicate a simultaneous shift.

Also, a limit, called CVD (Cutting Value Diagnosis) is assigned for each SVRD to identify the change in the parameters. To conduct diagnosis actions after an OOC signal by ESVR control chart, the \(1+p\) vector EWMAp is computed using the last (i.e., current) sample and then it is considered as the input of each SVRD. In other words, SVRD1, SVRD2, … and SVRDp have the same input. The outputs of SVRDs are compared with their CVD to identify the shifted parameters.

Considering the IC model given in (17) with p = 2, SVRS includes SVRD1 and SVRD2 with CVD1 and CVD2 such that the shift in the first and second parameters are identified, respectively. To better illustrate the above diagnosis, Fig. 7 depicts the diagnosis procedure of SVRS after an OOC signal in the jth profile is detected, using the IC model defined in (17) when p = 2.

Fig. 7
figure 7

The profile diagnosis procedure of SVRS approach after an OOC signal by ESVR when p = 2

To train each SVRD in SVRS, some OOC profiles are generated until an OOC signal is obtained by ESVR. Then, the EWMAp is considered as the inputs of the training dataset. The targets are defined such that the target value of the shift in the pth parameter is 1 and the other values are 0 in SVRDp. Other training aspects and assigned limits are the same as the ones in Yeganeh and Shadman (2020). Hence, the details are not reported here for brevity.

6.3 The accuracy of the proposed SVRS structure in profile diagnosis

Due to the lack of research for profile diagnosis in Poisson profiles, four other competitors entailing three machine learning techniques and one statistic method are provided in this paper. First, we use multiclass SVM (the details and concept are ignored here to save space) with ‘fitcecoc’ function in MATLAB called MSVM. For a better comparison, ANN training with back-propagation and entropy-based training algorithm (‘feedforward’ and ‘patternnet’ functions in MATLAB) denoted by ANN-BP and Patternnet are also carried out for profile diagnoses. In addition to these methods, the Wald statistic proposed in Huwang et al. (2016) is also computed as the last competitor (denoted as Wald Test). The OOC profiles are generated from (17) with \(p=2\). The simulation procedure for obtaining accuracies in profile diagnoses with SVRS when there is shift in the intercept is described by Pseudocode 3.

figure c

Note that Pseudocode 3 changes to the situations that OOC shifts have been occurred in slope and simultaneous by replacing the following codes with the last if.

  • If (output of SVRD1 < CVD1 & output of SVRD2 > CVD2) % Shift in slope

  • If (output of SVRD1 > CVD1 & output of SVRD2 > CVD2) % Shift in intercept and slope

The results of diagnosis accuracy are reported in Table 7 for Poisson profiles based on 10,000 iterations (MaxIt), where the boldfaced values denote the best performing scheme at that particular shift size. For example, the accuracy of SVRS in the first shift is 0.52 which means that “Corrected” value in Pseudocode 3 is 5200. From Table 7, it can be seen that SVRS and Patternnet are preferred over others in terms of the average of accuracies while ANN-BP is the best method in terms of the standard deviation of accuracies. MSVM and Wald Test have biased performances; that is, for some shifts, they are not able to detect any shifts while they have good accuracies for others shifts.

Table 7 Profile diagnosis accuracy

7 Illustrative example

A real-life application of Poisson profiles in the airline industry is provided here from Chatterjee and Hadi (2013) and Alevizakos et al. (2019b). The aim of this example is to check the relationship between the number of injury incidents and the proportion of total flights over time. Naturally, it is expected that the probability of accidents will increase with an increase in the proportion of total flights.

To this end, the accidents and injuries of nine major USA airlines were studied in these references. If all the airlines have equally safe performance in a specific period, the injury incidents can be explained by the IC model considering the number of flights of each airline as a percentage of the total number of the airlines as an explanatory variable and the injury incidents as a response variable. Following Eq. (2), the IC Poisson model is established by the relationship between the explanatory and response variables:

$$\begin{gathered} {\varvec{\beta}}_{{\varvec{0}}} = \left[ {0.8945\,\,8.5018} \right]{,} \hfill \\ \tilde{\varvec{X}^{\prime}} = \left( {\begin{array}{*{20}c} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ {{0}{\text{.0503}}} & {{0}{\text{.054}}} & {{0}{\text{.0629}}} & {{0}{\text{.075}}} & {{0}{\text{.095}}} & {{0}{\text{.1292}}} & {{0}{\text{.1382}}} & {{0}{\text{.1920}}} & {{0}{\text{.2078}}} \\ \end{array} } \right){,} \hfill \\ n = 9{,} \, p = 2. \hfill \\ \end{gathered}$$
(21)

To reach the ARL0 equal to 200, the values of UCLMEWMA, UCLLRT and CV are obtained as 1.303, 10.53 and 6.61, respectively. As a common approach in Phase II applications, the intercept changes to 0.965, as an artificial shift, to reach an OOC signal. Table 8 gathers the details of 11 OOC generated profiles. It provides the details of estimated parameters (first part), normalized parameters (second part), EWMA statistic of the normalized parameters (third part), MEWMA and LRT statistics (fourth part) and the ratio of samples (fifth part), respectively. Note that the input vector of ESVR in this example is defined with length 11 (p + 9); for example, the input in the first sample (j = 1) is [0.31 − 2.17 0.2 1 0 0 0.23 0 1 0 6.98].

Table 8 The OOC profiles characteristics of the illustrative example

The outputs of ESVR for the above inputs are depicted in Fig. 8. The MEWMA and LRT statistics are also added to this figure to visualised their trends as compared to the ESVR. It can be observed from Fig. 8 that ESVR triggered an OOC signal at the 11th sample while LRT and MEWMA procedures were not able to detect this shift.

Fig. 8
figure 8

The statistics of the ESVR, MEWMA and LRT control charts for 11 OOC generated profiles in the illustrative example

To identify the shifted parameter, EWMAP = [0.1 − 0.59 1.15] is incorporated into the SVRD1 and SVRD2 and the outputs are 1.16 and − 0.27, respectively. This is an indication of shifts in the first parameter because CVD1 and CVD2 are 0.39 and 0.42 (i.e., 1.16 > 0.39 and − 0.27 < 0.42).

From this example, it is observed that the Oj exceeds its control limit sooner than MEWMA and LRT schemes and the results under the existing case study are in accordance with the simulation results. This excursion suggests that the proposed ESVR has excellent abilities in practical applications of Phase II SPC problems in comparison to other competitors. Also, these findings act as evidence that the proposed diagnosis approach encompass a significant impact on the detection of shifted parameters. Therefore, the ESVR control chart is found to be more efficient in Poisson profile monitoring.

8 Conclusions

A novel use of SVR as a control chart is extended to monitor Poisson profiles in Phase II. This method, equipped with new input features and evolutionary training procedure based on the PSO algorithm, is able to quickly detect the OOC situations due to the advantage of an evolutionary training framework either in parametric or non-parametric monitoring where the OOC model can be unknown. To design a more efficient method for identification of small and moderate shifts, the proposed scheme is incorporated with additional run-rules. Finally, a diagnostic procedure is used with SVR structures and ANN. Both the SVR and ANN approaches showed a better detection ability. The contributions of this study include: firstly, implementation of SVR as a control chart in monitoring Poison profiles. Secondly, utilizing a novel input feature corresponding to the ratio of MEWMA and LRT statistics. Lastly, the training of the SVR using evolutionary PSO algorithm.

That said, owing to the requirements in evolutionary computations for training of ESVR, the proposed approach requires more computations than the statistical approaches such as MEWMA and LRT. Note though, this challenge commonly occurs in the machine learning applications but due to rapid extension of artificial intelligence related technology from the software and hardware aspects, the importance of this challenge has decreased in recent years.

For future research purpose, the investigation of other evolutionary algorithms with different IC profile types and sample sizes would be preferable. Moreover, the proposed method can conveniently and effectively tackle the problem of other non-parametric profile monitoring; for example, applying the proposed approach with linear IC model can be a good idea for future research.