1 Introduction

There are two main types of complex system models, the first-principle models and the data-driven models. The first-principle models require strong professional background knowledge and a deep understanding of modeling objects. Also, many approximations and assumptions are needed in the modeling process. Therefore, when the object structure is complex, a first-principle model cannot be obtained. In contrast, the data-driven models are more robust to determine the relationship between the input and output data. These models are based on a small amount of background knowledge and have a strong data processing capability.

The establishment process of data-driven models can be divided into two categories according to whether the model structure needs to be determined in advance. The first category determines the model structure first, and then identifies the model parameters, developing a parameter model. The second category determines model structure and model parameters simultaneously during the modeling process, developing a non-parameter model. The latter methods are more computationally complex, but the model accuracy is higher than the former. However, with the development of deep learning, many efficient algorithms have emerged in these methods, such as stacked autoencoder, deep belief networks (DBN), and long short-term memory (LSTM).

According to the number of models, a data-driven model can be a single model or a multi-model. Commonly, a single model is used to describe objects with obvious linear characteristics, and a multi-model is applied to complex system modeling and complex industrial data monitoring [1,2,3]. The basic idea of the multi-model is to establish sub-models corresponding to different operating conditions or modes with different characteristics to form the model database of conditions or models, which is used to identify the current data status during the online monitoring and select the appropriate sub-model for data monitoring.

The multi-model method needs to establish different monitoring models for multiple modalities with obvious differences, so how to divide the offline historical data into different modes when the prior knowledge of the process is insufficient is an important problem to be solved. This problem is embodied in the working point identification in the multi-model modeling process and the modal identification in the industrial monitoring process.

The traditional methods for multi-condition identification include artificial judgment based on prior knowledge and machine assistance based on the recursive method [4]. With the development of machine learning, many practical methods for condition division or modal identification have been developed, such as Sliding Windows (SW), Top-Down, Bottom-Up, Sliding Window and Bottom-Up (SWAB) [5], Feasible Space Window (FSW) [6]. Ge et al. [7] divided the historical data into several independent groups using the fuzzy C-means clustering method, and then established the sub-models corresponding to the data subsets using feature extraction, and finally, integrated the monitoring results of each sub-model with Bayesian reasoning, the proposed method can handle non-Gaussian information in each operation mode. Zhu et al. [8] proposed a two-layer clustering method based on the global moving window strategy and global clustering solution strategy and successfully divided the data into different modes using this method. The method allows the different ICA–PCA models to overlap. Lu [9] and Zhao et al. [10] applied the K-means clustering algorithm to the period division of batch processes, the performance of the methods in continuous process needs to be discussed. In [11], a fuzzy segmentation of time series based on the core principal component analysis (KPCA) and Gath-Geva (G-G) clustering, where the window division of multivariate time series is used, is proposed, the advantage of the method is that time dimension attributes are introduced as extra variables. Zhang Yue [12] introduced the principal component analysis into the design method of sub-window time span division, and demarcation points of the time span of the sub-windows are determined by the piecewise analysis, rolling merging, and cyclic validation. The final result is obtained by multi-step cyclic iteration, and the calculation is heavy. Song et al. [13] used the recursive local outlier factor algorithm to divide the multimode chemical process into the stable mode and conversion mode and established the corresponding models, in the algorithm, not only the number of modes does not need to be determined in advance, but also details of mode switching can be acquired. Lv [14] proposed a feature extraction method based on the weighted kernel Fisher criterion to improve the clustering accuracy, where the feature mapping is adopted to bring the edge classes and outliers closer to other normal subclasses. LI Wei [15] used the fuzzy C-means clustering algorithm based on the conditional positive definite kernel to realize the clustering division of a dataset, and then the least square support vector regression is conducted for each cluster. The method realized the clustering of irregular data. Zhang Shu Mei [16] proposed an automatic offline modal recognition method for multi-modal processes based on an improved K-means clustering algorithm. The method avoided the influence of manual identification on the results. Among the existing analysis methods, clustering based on unsupervised learning has been most widely used, and its main advantage is that it does not rely on prior knowledge, but it also has certain shortcomings, especially in the case of window or modal boundary data.

This paper considers the time-sequence relationship of the object under study from the perspective of semi-supervised clustering and proposes a hybrid constraint that combines pairwise constraints and time constraints to improve the identification accuracy of working conditions or mode in the boundary area. The simulation results show that the semi-supervised clustering based on mixed constraints has higher accuracy in condition identification, especially in the boundary data.

This paper is organized into three major sections. Firstly, the background knowledge of the paper is introduced, including the cost of condition division and traditional methods. Secondly, the semi-supervised clustering with mixed constraints is introduced in detail. Thirdly, an online recognizer based on RBFNN is designed. Fourthly, the method in this paper is compared with the traditional method by simulation. The final section presents the key conclusions and limitations of this work, while offering future directions for research that could advance the current body of knowledge on this subject.

2 Background knowledge

2.1 Cost of condition division

Process data condition partition is equivalent to the problem of multivariate time series segmentation. In essence, it means that for a given k-dimensional time series \(X = \left\{ {\left. {x_{1} ,x_{2} , \ldots x_{T} } \right\}} \right.,x_{T} = \left( {x_{1t} ,x_{2t} , \ldots x_{kt} } \right)^{T}\), the time domain is divided according to the change law of the data and the correlation relationship before and after.

Assuming that the time series is divided into \(N\) segments, the boundary time label of the segmentation result is defined as \(t = \left\{ {\left. {t_{1} ,t_{2} , \ldots t_{N} } \right\}} \right.\), then the segmentation result \(t\) satisfies \(0 < t_{1} < t_{2} < \cdots < t_{N} = T\).

In the problem of time series segmentation, \(t_{1} ,t_{2} , \ldots ,t_{N}\) are called segmentation boundary or mutation points, \(\left[ {t_{1} ,t_{2} \left] , \right[t_{2} + 1,t_{3} \left] , \right[t_{3} + 1,t_{4} } \right], \ldots \left[ {t_{N - 1} + 1,t_{N} } \right]\) are called segmentation segments, and the number of segments \(N\) is called segmentation order [17].

Thermal data condition partition or time series segmentation can be described as optimization problem. The overall cost of segmentation is \(J\left( t \right)\),

$$ J\left( t \right) = \mathop \sum \limits_{i = 0}^{N - 1} d_{{t_{i} + 1,t_{i + 1} }} $$
(1)

where \(d_{s,t} (0 \le s < t \le T)\) is the segmentation error of segment \(\left[ {s,t} \right]\). It is local error, and it is determined by the data in the time series segment \(\left\{ {\left. {x_{s} ,x_{s + 1} , \ldots x_{t} } \right\}} \right.\),

$$ d_{s,t} = \mathop \sum \limits_{T = s}^{t} (x_{\tau } - \hat{x}_{\tau } )^{T} \left( {x_{\tau } - \hat{x}_{\tau } } \right) $$
(2)

where \(\hat{x}_{\tau }\) is estimated value.

It is not a simple single objective optimization problem in the process of condition partition or segmentation. On the basis of ensuring the overall segmentation cost, it is necessary to make the local segmentation cost as close as possible and at a lower level. As mentioned above, the multivariate time series segmentation problem is transformed into a constrained optimization problem or a multi-step optimization problem.

2.2 The traditional methods for multi-condition identification

The traditional methods for multi-condition identification include artificial judgment based on prior knowledge and machine assistance based on the recursive method. Among them, the more famous methods include Sliding Windows (SW), Top-Down, Bottom-Up, Sliding Window and Bottom-Up (SWAB), Feasible Space Window (FSW).

Sliding Windows (SW) [5] the algorithm determines the width of the potential segment by recursive method. Anchor the left point at the first data point, then attempt to approximate the data to the right with increasing longer segments. At point i, the error is greater than the user specified threshold, so the subsequence from the anchor to \(i - 1\) is transformed into a segment. The anchor is moved to location \(i\), and the process repeats until the entire time series has been transformed into a piecewise linear approximation.

Top-Down [5] the algorithm works by considering every possible partitioning and splitting it at one location. Both subsections are tested to see if their approximation error is below some user specified threshold. If not, the algorithm recursively continues to split the subsequences until all the segments have approximation errors below the threshold.

Bottom-Up [5] firstly, the algorithm creates the finest possible approximation of the time series, so that \(n/2\) segments are used to approximate the n-length time series. Next, the cost of merging each pair of adjacent segments is calculated, and the algorithm begins to iteratively merge the lowest cost pair until a stopping criteria is met.

Sliding Window and Bottom-Up (SWAB) [5] the algorithm keeps a small buffer. Bottom-Up is applied to the data in the buffer and the leftmost segment is reported. The data corresponding to the reported segment is removed from the buffer and more datapoints are read in. These points are incorporated into the buffer and Bottom-Up is applied again. This process of applying Bottom-Up to the buffer, reporting the leftmost segment.

Feasible space window (FSW) [6] the algorithm introduces a point called a Candidate Segmenting Point (CSP) which may be chosen to be the next eligible segmenting point. The distances of all the points lying between the last segmenting point and the new chosen one are all within the maximum error tolerance. The key idea of FSW is to search for the farthest CSP to make the current segment as long as possible under the given maximum error tolerance.

The comparison of the above algorithms is shown in Table 1.

Table 1 Comparison results of the above algorithms

2.3 Semi-supervised clustering method

Unlike the traditional unsupervised clustering algorithms, such as the K-means algorithm and expectation–maximization (EM) algorithm, semi-supervised clustering combines clustering and semi-supervised learning to improve the clustering performance using a small amount of label data and prior knowledge in massive data. The semi-supervised clustering algorithms can be divided into constraint-based semi-supervised clustering algorithms, distance-based semi-supervised clustering algorithms, and constraint- and distance-based semi-supervised clustering algorithms.

2.3.1 Semi-supervised clustering based on constraints

The idea of such algorithms is to add constraint and restriction information to the traditional clustering basis to improve the clustering performance. The most typical algorithms are the Seeded-K-means [18] and Cop-K-means [19] algorithms, which play a crucial role in the development of semi-supervised clustering from different perspectives of supervisory information.

The Seeded-K-means algorithm uses a small number of labeled samples as seed sets, and obtains the initial cluster centers according to the seed sets, improving the clustering accuracy. Basu et al. [18] define the seed set as follows: when \(S\) satisfies \( S\) ⊆ \(L\), for \( \forall x_{i} \in S\), where i = 1, 2,⋯,|S|, and |S| is the size of S, and \({ }\exists y_{i} \in Y\), so that \(\left( {x_{i} ,y_{i} } \right) \in L\), then S is called the seed set. In particular, when the number of categories which the samples in S belong to is equal to k, S can be expressed as \(S = {\text{U}}_{i = 1}^{k} S_{i}\), where \(S_{i}\) is the class i non-empty sample set.

Although the semi-supervised clustering algorithm based on the seed sets can effectively improve the clustering performance, it largely depends on the scale and quality of the seed set. Guo Maozu and Deng Chao [20] introduced a semi-supervised clustering algorithm based on the tri-training and data editing, and combined it with the depuration data editing technology to correct and purify the mislabeled samples in the seed set while expanding the size of the seed set to improve its quality.

The Cop-K-means algorithm introduces the idea of pairwise constraints to the traditional K-means algorithm. During the data distribution process, data objects must meet the Must-link (ML) constraints and Cannot-Link (CL) constraints. In the ML, two selected points must belong to the same class, and in the CL, two selected points do not have to be elements of the same class. The constraints have symmetry and transitivity characteristics that are expressed as follows.

Symmetry:

$$ \left( {x_{i} ,x_{j} } \right) \in {\text{ML}}\quad \left( {x_{j} ,x_{i} } \right) \in {\text{ML}} $$
$$ \left( {x_{i} ,x_{j} } \right) \in {\text{CL}}\quad \left( {x_{j} ,x_{i} } \right) \in {\text{CL}} $$

Transitivity:

$$ \left( {x_{i} ,x_{j} } \right) \in {\text{ML }}\& \left( {x_{j} ,x_{k} } \right) \in {\text{ML}}\quad \left( {x_{i} ,x_{k} } \right) \in {\text{ML}} $$
$$ \left( {x_{i} ,x_{j} } \right) \in {\text{CL }}\& \left( {x_{j} ,x_{k} } \right) \in {\text{CL}}\quad \left( {x_{i} ,x_{k} } \right) \in {\text{CL}} $$

Symmetry and transitivity are crucial for pairwise constraints, which means that when a sample is forced to allocate the constraint relationship, the constraint violation will occur only in the case of CL constraint, and in other cases, there will be no sample allocation failure.

The quality of constraint information directly affects clustering results, so the clustering performance can be improved only by obtaining better constraint information. Zhu Yu et al. [21] proposed an improved linked COP-K-means (LCOP-K-means) algorithm based on the breadth-first search and the Cop-K-means algorithm, which improved data stability and clustering accuracy. Li Chao Ming et al. [22] proposed a cross-entropy semi-supervised clustering algorithm based on pair constraints. This method uses the cross-entropy of samples to express the pairwise constraint information, thus providing higher clustering accuracy and better results using a smaller amount of pairwise constraint information.

2.3.2 Semi-supervised clustering combining multiple methods

Chen [23] and Chang [24] have suggested that the semi-supervised clustering algorithm can simultaneously use two types of supervision information, such as class label and pairwise constraints, for clustering, especially when active learning is added to actively label samples, and higher quality supervision information and better clustering results can be obtained.

Wei et al. [25] proposed a semi-supervised clustering method based on pairwise constraints and measures. In the case of data marked by pairwise constraints, the semi-supervised clustering method based on constraints and measures is used to generate different basic clustering partitions, and then the target clustering is conducted by integration.

Compared with the single clustering method, the combination of multiple clustering methods can yield the best use of given monitoring information, improving the algorithm performance. For the identification of multi-model working conditions, this article uses the initial seed set to provide the number of working conditions and the centers reference of the dataset under the same working condition. It focuses on solving the problem of a fuzzy boundary between different working conditions. By introducing mixed constraints, the accuracy of the boundary division of working conditions has been improved.

3 Multi-model condition identification of thermal process data based on mixed-constraint semi-supervised clustering

3.1 Characteristics of thermal process data

Different from batch process and random process, thermal process data has its own unique characteristics, such as strong coupling, non-linear, slow time-varying, etc. Reflected in the data, the main performance is that there are many factors affecting the data change, it is difficult to predict, the correlation between different data is strong, and has obvious time series characteristics.

Thermal process is usually considered as a transition process from a steady state to a new steady state. Therefore, the thermal process data can be divided into steady-state condition data and transition condition data. The characteristic of steady-state data is that it fluctuates in a small range near the steady-state data. The transition condition data appears to be disorderly in an ultra short period, and the overall change trend is to increase or decrease in one direction. There is no instantaneous jump of amplitude in both steady-state data and transient data.

According to the above characteristics of thermal process data, the key to the division of steady-state condition and process condition is to solve three problems: first, the problem of working condition central point; second, data screening problem with obvious strong connection relationship with working condition central point; third, determination of boundary data of adjacent working conditions.

3.2 The flowchart of the condition identification

In this paper, the semi-supervised clustering with mixed constraints is used to realize condition identification, as shown in Fig. 1.

Fig. 1
figure 1

The flowchart and pseudocode of the condition identification using the semi-supervised clustering with mixed constraints

3.3 Initial seed set establishment

After data preprocessing, the dataset is subjected to the whitening process by referring to the idea of the PCA algorithm to reduce the linear correlation under the premise that the data are as true as possible. The goal is to minimize the variance between the original data and the preprocessed data.

$$ {\text{J}} = \frac{1}{N}\mathop \sum \limits_{n = 1}^{N} x_{n} - \tilde{x}_{n}^{2} $$
(3)
$$ \tilde{x}_{n} = \mathop \sum \limits_{i = 1}^{M} a_{ni} u_{i} + \mathop \sum \limits_{i = M + 1}^{D} b_{i} u_{i} $$
(4)

where \(x_{n}\) is the original data, \(\tilde{x}_{n}\) is the preprocessed data, \(N\) is the number of data, \(M\) is principal component dimension, \(\left\{ {u_{i} } \right\}\) is D dimensional unit orthogonal set.

The correlation coefficients of different dimensions are guaranteed to be as small as possible. The correlation coefficient \(\rho_{{{\text{ij}}}}\) is defined as follows:

$$ \rho_{{{\text{ij}}}} = \frac{1}{N}\mathop \sum \limits_{n = 1}^{N} \frac{{\left( {x_{ni} - \tilde{x}_{i} } \right)}}{{\sigma_{i} }}\frac{{\left( {x_{nj} - \tilde{x}_{j} } \right)}}{{\sigma_{j} }} $$
(5)

where \(\sigma_{i}\) and \(\sigma_{j}\) is standard deviation.

Next, the data is processed by the density-based clustering method [26], and the initial seed sets are established according to the distance from the data centre.

3.4 Clustering data screening based on distance \(D_{ij}\)

According to the characteristics of the thermal process data mentioned in Sect. 3.1, there are three problems in dividing the steady state and process conditions to be solved: (1) the problem of the central point of the working condition, which can be solved by the previous preliminary seed set; (2) the data screening problem with an obvious, strong connection relationship with the central point of the working condition, which can be solved by the clustering data filtering based on distance from centres; and (3) the problem of determination of the boundary data between adjacent working conditions, which can be clarified by the following boundary based on the hybrid constraint to fulfill.

According to the thermal data characteristics, distance \( D_{ij}\) is defined as a comprehensive index, including the sampling time length from the centre point of working condition \(t_{dis}\), the absolute value of amplitude deviation from the center point of working condition \(e_{{\text{u}}}\), and the absolute value of the deviation between the data change velocity at the sampling time and the data change velocity at the centre point \(\partial_{{\text{v}}}\), and it is expressed as follows:

$$ \begin{aligned} D_{ij} & = \alpha_{1} t_{dis} + \alpha_{2} e_{{\text{u}}} + \alpha_{3} e_{{\text{v}}} \\ D_{ij} & = \alpha_{1} \left| {t_{ij} - t_{{c_{i} }} } \right| + \alpha_{2} \left| {x\left( {t_{ij} } \right) - x(t_{{c_{i} }} )} \right| + \alpha_{3} \left| {\partial \left( {t_{ij} } \right) - \partial (t_{{c_{i} }} )} \right| \\ \end{aligned} $$
(6)

where \(\alpha_{i}\) denotes the coefficient of the distance component, referring to the intensity of the component, \(t_{{c_{i} }}\) denotes the time label of the centre point of the ith case, \(t_{ij}\) denotes the time label of the jth point of the ith case, \(x(t_{{c_{i} }} )\) denotes the amplitude of the centre point of the ith case, \(x\left( {t_{ij} } \right)\) denotes the amplitude of the jth point of the ith case; \(\partial (t_{{c_{i} }} )\) denotes the speed of the centre point under the ith working condition, and \(\partial \left( {t_{ij} } \right)\) denotes the speed of the jth point under the ith working condition. Consider the time series \(X = \left( {x\left( {t_{1} } \right),x\left( {t_{2} } \right),x\left( {t_{3} } \right) \cdots x\left( {t_{n} } \right)} \right)\), and \(\Delta \left( {t_{k} } \right) = x\left( {t_{k} } \right) - x\left( {t_{k - 1} } \right), k = 2, 3, \ldots ,n\) denotes the data increment of time series X from \(t_{k - 1}\) to \(t_{k}\), \({\text{E}} = \frac{1}{n - 1}\sum\nolimits_{i = 2}^{n} {\left| {\Delta \left( {t_{i} } \right)} \right|}\) denotes the average absolute value of \(\Delta \left( {t_{k} } \right)\), \(\partial \left( {t_{k} } \right) = \frac{1}{E}\left| {\Delta \left( {t_{k} } \right)} \right|\) denotes the equalization of the data increment of time series X from \(t_{k - 1}\) to \(t_{k}\), and it represents the rate of change at time \( t_{k}\), and it is dimensionless data.

Therefore, the data with a strong relationship with the working condition center point can be converted into the clustering based on distance \(D_{ij}\). The objective function of the clustering problem is given by:

$$ {\text{J}} = \mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{j = 1}^{M} D_{ij}^{2} $$
(7)

The clustering based on distance \(D_{ij}\) can be used to divide the data with strong affiliation into datasets corresponding to different working conditions.

3.5 Clarification of class boundaries based on mixed constraints

The boundary data between adjacent working conditions refer to the data with a weaker relationship with the center point of a working condition. From the perspective of modeling, such data may belong to two adjacent working conditions at the same time; specifically, the closer the data to the boundary point are, the more obvious the multi-condition attribute of the data is. However, the problem of jitter and jumping of the condition attribute needs to be avoided. The standard K-means clustering algorithm cannot overcome this problem. For example, a set of typical thermal data, namely the bed temperature data of a fluidized bed boiler, is used for clarification. It includes 3000 sampling points with a sampling interval of 5 s. The standard K-means clustering is used to classify the normalized data from the dimension of value. The results are shown in Fig. 2.

Fig. 2
figure 2

The k-mean clustering results

In Fig. 2, different colors represent different categories; the red line represent the category belonging of the corresponding sampling time points. The standard K-means clustering considers only the value, so the category lines are irregular and frequently jittered, and such a classification is of little significance.

In view of this, mixed constraints are designed to achieve clearer working condition boundaries and avoid jitter problems. According to the category continuity characteristics of adjacent points, the pairwise constraint is improved by superimposing on the distance constraint, and the results are as follows:

$$ {\text{J}} = \mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{j = 1}^{M} D_{ij}^{2} + \mathop \sum \limits_{{\begin{array}{*{20}c} {x_{i} ,x_{j} \in M} \\ {s.t. l_{i} \ne l_{i} } \\ \end{array} }} {\text{w}}_{ij} + \mathop \sum \limits_{{\begin{array}{*{20}c} {x_{i} ,x_{j} \in C} \\ {s.t. l_{i} = l_{i} } \\ \end{array} }} \overline{w}_{ij} $$
(8)

where \(M\) and \(C\) are the given Must-link set and Cannot-link set respectively; \({\text{w}}_{ij}\), \(\overline{w}_{ij}\) is the penalty weight for violating Must-link and Cannot-link constraint rules, respectively. In semi-supervised clustering method with pairwise constraints, the constraint set satisfying li = lj is called the Must-link set, and the constraint set satisfying li ≠ lj is called Cannot-link set.

In Eq. 8, \(D_{ij}\) is the composite distance mentioned above, which is composed of the change characteristics of data and the time span from the central point of the category. It makes the category of data clearer. Pairwise constraints reduce the frequent jumps of the categories of boundary data.

Moreover, the boundary area data may have multi category attributes, which can be judged by the change rate of time series data. For example, when the change rate of sampling points is less than a certain fixed value within a time span, it can be approximately considered that the data within the time span is in a steady state. If the time span is just within the boundary interval, the data can have multi category labels and belong to two adjacent categories.

3.6 Hyperparameter optimization

In the working condition identification process, there are three hyperparameters; the first is the time span (time distance) of data with a strong relationship with the center point of clustering; the second is the time span of multi-category labels in the boundary area; and, the third is the time span \(t_{s}\) that is used to calculate the change in the data increment at a certain time series. There are three common methods for determining the hyperparameters, manual method based on experience, machine-assisted method, and algorithm-based method. In this work, the machine- assisted method has been selected. Therefore, there are two problems to be solved, an optimization objective function and an optimization method.

The model performance is modeled by polynomial fitting under the same conditions for all partitioned working condition data, and the accumulated error of the multi-model is taken as an optimization objective function of the hyperparameter optimization.

$$ {\text{E}} = \mathop \sum \limits_{{i = {\text{i}}}}^{m} \left[ {\frac{1}{2}\mathop \sum \limits_{j = 1}^{n} \left( {y_{ij} \left( {x_{ij} ,\omega } \right) - y_{ij} } \right)} \right] $$
(9)

The conventional grid-search method has been selected as the optimization method.

4 Online condition recognizer based on RBFNN

RBFNN is considered to be one of the most promising algorithms. Compared with other artificial neural networks, RBFNN is more popular because of its simple structure, fast learning process and appropriate approximation ability [27, 28]. There are many factors that affect the performance of RBFNN, such as the network structure, hidden layer activation function, the connection between nodes, training methods, and so on. Many researchers have given good suggestions in these methods. Mosavi et al. analyzed the hidden layer structure of RBF neural network and proposed an efficient training method, which uses Stochastic Fractal Search Algorithm (SFSA) for training RBFNN [29]. By choosing a more reasonable radial basis function, the problem of slow classification speed of RBFNN is solved [28, 30, 31].

The connection between nodes determines the behavior of neural network. There are many connection methods, The most common is full connection. All nodes in a layer are connected to all nodes in a higher layer, and there are other methods such as sparse connection network [32] and direct connection between input and output nodes [33]. If the selection of radial basis function is reasonable, the hidden layer nodes in full connection mode can filter the input data well.

Improving efficiency and ensuring quality is the direction of training method improvement. The growing RBFNN improves training efficiency by increasing the number of RBF units at each learning step [32]. It is an effective solution to ensure the training quality by k-fold cross validation of training samples and validation samples [33].

In summary, the design of RBFNN needs to solve the number of nodes, the type of hidden layer activation function, the connection mode between nodes and the weight adjustment method.

4.1 Structure design of RBF neural network

RBFNN has three layers, such as input layer, output layer and hidden layer. The number of input layer and output layer nodes is easy to determine, which is determined according to the characteristic variables of the model. In this paper, the number of input nodes \(N_{input}\) is determined by the composite distance \(D_{ij}\), including the data value and change rate at the current sampling time, and the data value and change rate at the adjacent sampling time,

$$ N_{input} = N_{range} \times N_{sample} $$
(10)

where \(N_{range}\) is the range of adjacent data and \(N_{sample}\) is the number of the reference information at the sampling time.

The number of hidden layer nodes \(N_{hidden}\) is determined by the dimension of input data \(N_{input}\) and the number of conditions \(N_{cond}\). The number of working conditions can be determined by machine learning method [12],

$$ N_{hidden} = N_{input} \times N_{cond} $$
(11)

Considering the nonlinearity of the thermal process data, radial basis function can describe the information of conditions (The parameter C of radial basis function represents the center of condition, and the closer to the center, the greater the output value). Therefore, standard RBF (SRBF), Cauchy RBF (CRBF), inverse multiquadric RBF and generalized inverse multi-quadric Functions can meet the basic requirements of the algorithm.

4.2 Weight calculation of RBFNN

Based on what is explained, the RBFNN structure comprises three layers as shown in Fig. 3. The input of the ith hidden neuron \({\text{s}}_{i}\),

$$ s_{i} = \left[ {x_{1} \omega_{1,i}^{h} ,x_{2} \omega_{2,i}^{h} ,x_{3} \omega_{3,i}^{h} , \ldots x_{n} \omega_{n,i}^{h} , \ldots x_{N} \omega_{N,i}^{h} } \right] $$
(12)

where \(n\) is the index of input; \(i\) is the index of hidden unit; \(x_{n}\) is the nth input and \(\omega_{n,i}^{h}\) is the input weight between nth input and ith hidden unit.

Fig. 3
figure 3

Structure of RBFNN

The output \({\text{o}}_{j}\) of jth neuron is calculated as follows:

$$ {\text{o}}_{j} = \sum\limits_{p = 1}^{P} {{\varphi }_{p} \left( {s_{p} } \right)\omega_{p,j}^{o} + \omega_{0,j}^{o} } $$
(13)

Here \(j\) is the index of output, \(\omega_{p,j}^{o}\) is the output weight between the pth hidden neuron and output neuron \(j\), and \(\omega_{0,j}^{o}\) is the bias weight of the jth output neuron.

The weight adjustment value is obtained by gradient descent method [34]:

$$ \Delta {\upomega }_{ij} = \eta_{1} \left( {y_{i}^{\left( k \right)} - f_{i} \left( {x^{\left( k \right)} } \right)} \right)\varphi_{j} \left( {x^{\left( k \right)} } \right) $$
(14)
$$ \Delta {\upmu }_{j} = \eta_{2} \varphi_{j} \left( {x^{\left( k \right)} } \right)\frac{{x - {\upmu }_{j} }}{{\sigma_{j}^{3} }}\mathop \sum \limits_{i = 1}^{m} {\upomega }_{ij} \left( {y_{i}^{\left( k \right)} - f_{i} \left( {x^{\left( k \right)} } \right)} \right) $$
(15)
$$ \Delta \sigma_{j} = \eta_{3} \varphi_{j} \left( {x^{\left( k \right)} } \right)\frac{{x - {\upmu }_{j}^{2} }}{{\sigma_{j}^{3} }}\mathop \sum \limits_{i = 1}^{m} {\upomega }_{ij} \left( {y_{i}^{\left( k \right)} - f_{i} \left( {x^{\left( k \right)} } \right)} \right) $$
(16)

The online condition recognizer realizes the mapping from continuous data to discrete data. By growing the number of hidden layer nodes, the nonlinear approximation ability of the network will be improved, and the amount of computation will also be increased. By increasing the number of adjacent sampling time and the data information of sampling time, the recognition rate of the network can be improved.

5 Simulation analysis

5.1 Condition division

In this section, the typical thermal data set used above is used for simulation. The normalized data is shown in the scatter in Fig. 2.

5.1.1 Compare the single time constraint with mixed constraints

Based on the changing trend of the data, there are many working conditions. The clustering analysis is used to divide the working conditions, and the results of standard K-means clustering are shown in Fig. 2. The category labels of the data clearly showed that in the adjacent boundary areas of different working conditions, the data category had repeated jumping process, which differed from prior knowledge of condition division. After adding a single time tag constraint, the classification results shown in Fig. 4 are obtained.

Fig. 4
figure 4

Semi-supervised clustering results with a single time constraint

In Fig. 4, the method of Semi-supervised clustering results with single time constraint is shown to have superior performance compared to the standard K-mean clustering, which has been shown in Fig. 2. But, When the time tag is around 400 and 2350, there is still the phenomenon of category jump, such as the change of category label in the enlarged area.

The proposed method is used to conduct semi-supervised clustering with mixed constraints, and the clustering results are shown in Fig. 5. Comparing with the the enlarged area in Fig. 4, the obtained category label of the data shows that the phenomenon of the repeated jump of data type has been weakened. In Fig. 5, near the time tags 400 and 2350, there are data classification overlaps between the two categories, which verifies the previous analysis of the thermal process data characteristics.

Fig. 5
figure 5

Semi-supervised clustering with mixed constraints

5.1.2 Comparison of different number of initial seed sets

In addition to the hyperparameters mentioned in Sect. 3.6, the information of the initial seed sets will also affect the segmentation results, it includes the location and number of seed sets. If the number of initial seed sets is defined, the location can be determined by the density-based clustering method. The influence of different number of initial seed sets on segmentation results is shown in Figs. 6 and 7.

Fig. 6
figure 6

Semi-supervised clustering with mixed constraints and four initial seed sets

Fig. 7
figure 7

Semi-supervised clustering with mixed constraints and six initial seed sets

Compare Figs. 5, 6 and 7, setting a smaller number of initial seed sets can make the segmentation of the original data clearer, but the segmentation effect is evaluated by the model polynomial fitting in segmented data. On this basis, the data of each working condition is taken as an output, the four group of data such as bed pressure, primary air flow, second air flow and fuel flow are selected as an input, and the model polynomial fitting is performed. The results of model polynomial fitting on segmented data are shown in Table 2. The segmentation effect of setting five initial seed sets is better than setting four or six initial seed sets.

Table 2 Comparison results of different number of initial seeds

5.1.3 Compare semi-supervised clustering with mixed constraints with Sliding windows, bottom-up, top-down

The data of each working condition is taken as an output, the four group of data such as bed pressure, primary air flow, second air flow and fuel flow are selected as an input, and the model polynomial fitting is performed. The comparison between the clustering results obtained using mixed constraints, Sliding Windows, Bottom-Up, Top-Down are shown in Fig. 8, The specific comparative data is shown in Table 3.

$$ R^{2} = \left( {1 - \frac{u}{v}} \right), $$
$$ u = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \hat{y}_{i} } \right)^{2} , v = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} y_{i} } \right)^{2} $$
(17)
Fig. 8
figure 8figure 8

Polynomial fitting on different segmentation methods

Table 3 Comparison results of different segmentation methods

In Table 3, the calculation formula of fitting score is shown in formula 17, different methods have different segmentation results, Sliding Windows, Bottom-Up, and Top-Down achieve good results in the linear fitting of local sub model, but the results in different segments have large deviation, and the overall effect is not as good as the method in this paper. The polynomial fitting results given in Table 3 show that the semi-supervised clustering with mixed constraints can achieve the condition identification, and compared with Sliding Windows, Bottom-Up, Top-Down, the mixed constraint is better for the division of working conditions.

5.2 Online condition recognizer

The online condition recognizer realizes the mapping from continuous data to discrete data. By increasing the number of hidden layer nodes, the network can perform perfectly in training data, but the network has poor normalization ability, as shown in Fig. 9.

Fig. 9
figure 9

RBFNN performance before optimizing weight coefficient \(\alpha_{i}\)

Here, the weight coefficients \(\alpha_{i}\) of composite distance \(D_{ij}\) are equal. Increase the amount of input data information, such as \(N_{range}\) and \(N_{sample}\) can improve the ability of normalization, but the amount of computation increases significantly. In general, using PSO (particle swarm optimization) to optimize the weight coefficients \(\alpha_{i}\) of composite distance \(D_{ij}\) can also improve the network normalization ability, but it does not increase the amount of calculation, as shown in Fig. 10. The optimization process of weight coefficients \(\alpha_{i}\) is shown in Fig. 11. Therefore, other parameters, such as \(N_{range}\), \(N_{sample}\) and the characteristic parameters of radial basis function in hidden layer, can be further optimized by PSO.

Fig. 10
figure 10

RBFNN performance after optimizing weight coefficient \(\alpha_{i}\)

Fig. 11
figure 11

Error rate curve

On the test data set, when the error rate is low, \(\alpha_{1} = 0.96,\alpha_{2} = 3.60,\alpha_{3} = 1\), the ratio of the weight coefficient shows that increasing the proportion of the numerical component of the input data and weakening the proportion of the sequence information can improve the normalization ability of the network. On the contrary, it is necessary to enhance the proportion of sequence information to improve the ability of condition identification.

6 Conclusion

The multi-model approach has been proved to be very effective in describing complex processes. The overall model precision is affected by the sub-model form and division of multiple model sub-windows. Once the time span of the sub-window is established, the modeling process within the sub-window is the same as that of a single model. Therefore, the division of sub-windows greatly affects the overall precision of the multi-model method.

In this paper, using machine learning and combining the characteristics of thermal process data, the semi-supervised clustering with mixed constraints is used to realize condition segmentation and sharpen division of time span of the child window. The online condition identifier is designed, and the network normalization ability is improved by optimizing the weight coefficient of input information. The simulation results show that the proposed method is feasible in dividing the sub-windows, and the overall error of the sub-model established is improved.

One limitation of this study is that the method presented in this paper is based on historical data and cannot be segmented online. Although the online recognizer is designed, it is still offline in essence. It limits the application of the method. Improving it into an online method is the next research content.