Introduction

Hot rolling is an important production line in the iron and steel industry, whose task is to roll thick slabs into thin coils. In practical production, the sequence of slabs to be rolled is first determined by the production scheduler, and then each slab in the production schedule needs to be reheated to a certain temperature in a walk-in heating furnace and subsequently rolled into thin coils through the hot rolling line. Before each slab enters the heating furnace, the operator in the production control room needs to check the number of that slab via remote video to ensure that the slab is being produced according to the slab sequence in the hot rolling schedule. The illustration of the slab numbers recognition process in the hot rolling is shown in Fig. 1. Since the production is carried out continuously, 24 h a day, the traditional recognition process is labor-intensive, inefficient, and prone to recognition, which in turn seriously affects the normal operation of production [1].

Fig. 1
figure 1

Illustration of the slab numbers recognition process in hot rolling production

Recently, computer vision techniques based on deep learning have achieved remarkable results in many industrial fields [2,3,4]. However, the poor environmental conditions in industrial production lines make the quality of the collected raw data usually very poor. They always have the characteristics such as blurring, mutilation, and low differentiation between foreground and background, which makes slab numbers recognition more challenging. Slab numbers recognition is essentially a text recognition problem. The mainstream approaches in the existing literature to solve such problems are based on deep neural networks, which can be divided into two categories: character recognition and sequence recognition.

The former first recognizes characters by means of a recognition model, and then integrates the recognition results into a sequence based on their positional relationships for output. Its advantage is that it is flexible and can achieve recognition of text of any length, while its disadvantage is that it requires a character segmentation step and is prone to cumulative errors. Characters recognition models are those widely used for image classification tasks, including various CNN-based models such as GoogLeNet [5], VGG [6], ResNet [7], etc., and various Transformer-based models such as Vit [8], DeiT [9], Swin [10], etc. These models have achieved good performance on specific benchmark datasets, but still have some limitations. Specifically, CNN-based models can only achieve a local perceptual field because the perceptual field is affected by the size of convolutional kernels and pooling layers, while expanding the perceptual field by continuously stacking convolutional layers will bring an increase in both model complexity and training difficulty. Transformer-based models can achieve the global perceptual field; however, the input size of Transformer-based models is strict, so their application scenarios are severely limited. In particular, only one scaling operation is performed on the input image in Vit, resulting in poor feature richness. In addition, DeiT has the problem that it performs quite differently for different scale targets.

Character segmentation is not necessary in sequence recognition, thus greatly simplifying the recognition process. Combining CNN and RNN to accomplish this task has played a leading role in this field. Representative algorithms include RARE [11], CRNN [12], and FOTS [13]. Specifically, RARE is more suitable for curved text, but has no significant advantage with the industrial slab dataset. CRNN is a classical algorithm for sequence recognition using CNN and RNN, which has achieved good results by being fully trained on IIIT5k and IC13. However, they are all CNN-based methods, so they inevitably have the limitations of CNNs. In addition, several representative algorithms that have emerged in recent years, such as such as DAN [14], SRN [15], ABINet [16], should be mentioned. Among them, DAN builds a decoupled attention module to recognize text by jointly using feature maps and attention maps. SRN and ABINet assist in recognizing text by building a semantic analysis module to mine semantic information in images.

Different from the above models, the recognition model proposed in this paper is not a pure CNN or Transformer model, as it has both the powerful underlying feature extraction capability of CNN and the global perceptual field mechanism of Transformer. Specifically, the recognition model is designed from a multi-scale perspective to improve the richness of features, which is advantageous for normal-scale datasets. Also, the proposed model has no strict requirement on the size of the input image, i.e., the input image can be slab number with any aspect ratio. Therefore, the model is more flexible and has more application scenarios.

In addition to the algorithms mentioned in the above literature, there are also some works investigating the recognition of slab numbers. Lee et al. [17] proposed several CNN models of different depths for the slab numbers recognition problem. Then a fully convolutional network (FCN) with deconvolution layers was developed by them [18]. Lee et al. [19] focused on the construction of deep learning datasets and built a CNN-based recognition model named GDT-FCN to realize the recognition of slab numbers. All the algorithms are based on CNN, a ‘regular’ operation, to accomplish this task. In addition to the typical shortcomings mentioned above, these algorithms are less robust and have unsatisfactory performance. Moreover, since deep learning is a data-driven algorithm, its effectiveness in accomplishing a given computer vision task is highly dependent on the support of high-quality datasets.

To address the limitations of existing algorithms and achieve online precise recognition of industrial slab numbers, a two-stage recognition algorithm with data quality improvement is proposed in this paper. In the first stage, HybridCy is proposed to alleviate the shortcomings of the unstable performance of traditional CycleGAN. Based on the effective quality improvement of the slab number data without paired data, the accuracy of downstream slab number recognition task can be significantly improved. In the second stage, the multi-scale hybrid Vit (MSHy-Vit) model is proposed to alleviate the over-dependence of Vit on data volume during the training process, which enables accurate recognition of slab numbers without significantly increasing the complexity of the model.

The rest of this paper is organized as follows. The next section presents the background and related works. Then the details of the proposed algorithm are presented. To evaluate the performance of the proposed algorithms, the subsequent section presents the experimental design, and the penultimate section gives the results and analysis of the experiments. Finally, conclusions and future work are presented.

Background

CycleGAN in computer vision

Generation adversarial network (GAN) [20] is modeled by adversarial learning to minimize the distance (Jensen-Shannon [21], Wasserstein [22], etc.) between real and generated samples to achieve indistinguishability between generated and real samples. Since its inception in 2014, plenty of research has been devoted to improving its associated theories and architectures, giving rise to wide adoption of GANs among a variety of vision tasks and is now experiencing great success, including image translation [23,24,25], image deraining [26], super-resolution reconstruction [27, 28], and various low-level computer vision fields.

As a new variant of GAN, CycleGAN [29] introduces an additional pair of generators and discriminators on the basic of GAN and creatively implements image migration between different domains on unpaired datasets by introducing cyclic consistency (cyc) loss for the first time. The framework is illustrated in Fig. 2.

Fig. 2
figure 2

Framework of CycleGAN

Specifically, CycleGAN contains two generators G, F and two discriminators \(D_X\), \(D_Y\), where the generator in the forward mapping is defined as \(G\) \(: X\) to Y. The forward mapping is used to coordinate the image distribution between the source domain X and the target domain Y. That is, the generator G is used to transfer the source image x to \(\hat{y}\) so that \(\hat{y}\) looks like it belongs to domain Y. Meanwhile, the discriminator \(D_Y\) tries its best to identify whether \(\hat{y}\) comes from Y. Similar to the forward mapping, a similar training process is performed in the inverse mapping \(F\) \(: Y \) to X. In conjunction with the introduction of the cyc loss, i.e., minimizing the distance between F(G(x)) and x (here, the \(L_1 \) distance), the migration between the source and target domains is finally achieved in CycleGAN. The role of the cyc loss here is to enable the image \(\hat{y}\) generated by generator G to be reconstructed as x and to prevent over-migration, which is the fundamental reason why CycleGAN can get rid of paired data.

Transformers in computer vision

Transformer is a novel network model for natural language processing (NLP) that eschews traditional CNNs and RNNs and enables the acquisition of correlations between long-term words through a multi-headed attention mechanism and a stacked feedforward MLP layer. In 2020, Transformer was introduced into the field of computer vision, and vision transformer (Vit) [8] was proposed, which converts images into tokens through patched operations, thus enabling image processing similar to that in NLP [30]. Precisely, the images are first sliced into patches, flattened, and then linearly projected. After encoding the position and classification information, the patches are fed into the Transformer encoder blocks. Each Transformer block includes a Normal layer, Multi-head attention, MLP layer, and a Dropout. After this series of operations in the Transformer block is completed, the classification results are output through the MLP Head layer.

The fundamental reason why Vit has gradually replaced CNNs as the dominant framework in computer vision is that the self-attention mechanism in Transformers provides several key advantages. First, self-attention can capture long-term relationships. Second, it can dynamically calculate the weights within the model through adaptive modeling, thus capturing the relationships between tokens. Finally, self-attention provides explicit insight into the key areas that the model should focus on. While Vit has some promising performance in the above-mentioned areas, it also has some limitations that cannot be ignored. Specifically, Vit requires more data than CNNs, although it can gain significant advantages with training on a large amount of data. To address this drawback, Vit can be improved in the following ways: (1) introducing distillation mechanisms (e.g., DeiT), (2) training the class token by self-supervision (e.g., LV-ViT [31]), and so on.

In addition, Xiao et al. [32] showed that ‘early convolution helps the Transformer see better’. In particular, the convolution stem can greatly improve the optimization stability of Transformer with similar flops and runtime. Meanwhile, the hybrid model has both the powerful underlying feature extraction capability of CNN and the global perceptual field mechanism of transformer, so the performance of the hybrid model is always better than that of a single model. With these advantages, some representative hybrid models have been proposed, including ViTAE [33], CoAtNet [34], DVT [35], and so on. Among them, ViTAE builds multiple pyramid modules to downsample images and embeds them into multi-scale tokens, which alleviates the possible induction bias when converting images into 1D sequences to some extent, while CoAtNet focuses on how to combine the two frameworks in a better way. DVT is a dynamic ViT model that allows adaptive characterization of each sample using the most appropriate number of tokens.

In this paper, the industrial slab dataset is not exactly a large-scale dataset, which may affect the performance of the model to some extent. Therefore, measures must be taken to expand the number of features available for the model. To address the above limitation, a multi-scale patching operation is used, and a multi-scale fusion module A-MSFF is designed, which can effectively achieve the fusion of multi-scale features without significantly increasing the complexity of the model. In addition, the I2T module is also designed to improve the feature quality in the initial stage.

Proposed algorithm

Overall framework of the two-stage algorithm

The schematic flowchart of the proposed two-stage slab numbers recognition algorithm is presented in Fig. 3.

Fig. 3
figure 3

Schematic flowchart of the proposed two-stage slab numbers recognition algorithm

As shown in Fig. 3, the proposed method consists of a two-stage deep network, which takes an industrial slab image as input and the recognition result of the slab number as output. Unlike existing algorithms for solving such industrial problems, a deep learning-based image reconstruction algorithm is designed to reconstruct low-quality data before completing the image classification task, which in turn improves the performance of the data-driven deep learning algorithm. The stages of the proposed method can be described as follows.

  1. 1.

    In the first stage, the HybridCy model is established by redesigning CycleGAN with Transformer, aiming to improve the quality of industrial data collected from the hot rolling production. Specifically, the traditional CNN-based CycleGAN is improved and optimized based on Transformer’s powerful global sensory field in the absence of paired data. By building a pyramid generator and combining it with the use of the pixelshuffle module, the resolution of the images is gradually increased, which eventually improves the stability of the model while reducing memory overflow and alleviating the instability in the training process of the traditional CycleGAN.

  2. 2.

    In the second stage, MSHy-Vit, a hybrid model with multi-scale fusion, is designed to address the limitations of single-scale patch operation and enrich the semantic information in traditional Vit. Unlike traditional Vit, which directly performs patch operation on images, MSHy-Vit uniquely constructs a CNN-based image2tokens module (I2T) to effectively guarantee the effective extraction of features at the initial stage of the model. On this basis, the slab numbers recognition algorithm is built, and the reliability of this algorithm in practical applications is further enhanced by the MES system.

Stage I: HybridCy for unpaired data improvement

In traditional CycleGANs, the task of the generator is to receive the feature information of the image in the source domain and generate a new image that can fool the discriminator, while the task of the discriminator is to discriminate the authenticity of the generated images. Given the inherent instability and training difficulty of CycleGANs, and considering the advantages of both CNN and Transformer, we propose a hybrid model HybridCy to improve the data quality. Its framework is shown in Fig. 3. It is worth noting that the two pairs of generators and discriminators in HybridCy overlap in structure, so only one of them is presented here.

  1. (1)

    Generator

Regarding the input of Transformer, inputting pixel by pixel as in traditional NLP, or inputting patch images one by one as in Vit, would incur an explosive computational cost. Instead of the two approaches, we use the high-dimensional features of the image as input in our algorithm, aiming to provide more guidance information for reconstruction while effectively reducing computational resources. Multiple experiments show that the input features obtained by shallow CNNs may produce better results than other methods, such as multilayer perceptrons.

Specifically, we construct a feature extraction module based on a shallow CNN consisting of four convolutional layers, one pooling layer, and one linear layer. To avoid the gradient vanishing, each convolutional layer is followed by batch normalization and LeakyReLu. The specific structure of the ‘cls’ model can be found in Table 1, where k, N, s, and p represent the kernel size, the number of convolution kernels, the step size, and the padding, respectively.

Table 1 Detailed configurations of feature extraction module CLS-cnn

As shown in Fig. 4, the feature shape gained from the last convolutional layer will be viewed as a one-dimensional vector of a specified length (by default, 1024 in our algorithm). After the linear layer, it is reshaped into a vector with the length of \(H_0\times W_0\times C\) (by default, \(H_0= W_0=8\), \(C=384\)). This vector will be then reshaped into a (\(H_0\times W_0)\times C\) feature map, where each point in the feature map is a C-dimensional embedding. After being encoded with the position parameters, it will be sent into the Transformer block and treated as a token with the length of 64 and the dimension of C.

Given that our strategy is to obtain relatively friendly computational resources and reduce memory overflow by constructing a pyramid generator, we increase the input size and reduce the embedding dimension step by step. The generator consists of three main stages with different depths, i.e., stacked with a different number of Transformer blocks (here, they are 5, 4, and 2 in order). The specific design of the Transformer block can be seen in Fig. 4, where the layer norm, multi-head attention, and MLP are included. To obtain the same size \(32\times 32\times 3\) as the image in target domain, an upsampling operation is performed after the Transformer block in the first two stages.

Fig. 4
figure 4

Framework of HybridCy

Unlike traditional operations that directly improve the resolution of feature maps through resampling or interpolation, the pixel shuffling operation [36] used here adopts a subpixel convolution operator instead of an interpolation filter. Its implementation flow is shown in Fig. 5, from which it can be seen that subpixel convolution is essentially a convolution operation. The difference is that the step size of subpixel convolution is 1/\(\gamma \), which is less than 1, so the width and height of the convolved feature map will not become smaller, but increase gamma times. Mathematically, it can be described as follows.

$$\begin{aligned} \left( C\times r^{2},H,W\right) \rightarrow \left( C,H\times r,W\times r\right) , \end{aligned}$$
(1)

where C, H, and W are the number of channels, height, and width of the images, respectively, and r is the scale factor. Since we set the image size of each stage to expand by two times in the implementation, r is set to 2.

Fig. 5
figure 5

Implementation flow of pixelshuffle

At the end of the generator, the \(1\times 1\) convolution with the step size of 1 and 0 padding is used to convert the features obtained above into the high-quality reconstructed images.

  1. (2)

    Discriminator

Since the discriminator only needs to complete the determination of image authenticity and does not need to be refined to the pixel level, the CNN-based discriminator is in CycleGAN are retained. As shown in Fig. 4, five convolutional layers are included in the discriminator. To avoid the gradient disappearance, LeakyReLu is adopted after each convolution. In addition, the batch normalization will be performed after the 2nd, 3rd, and 4th convolution, and more parameter settings are given in Table 2.

Table 2 Detailed configurations of discriminator
  1. (3)

    Loss functions

In the traditional CycleGAN, there are two generators G, F and two discriminators \(D_{X}\), \(D_{Y}\), which are optimized by alternating adversarial training. Mathematically, the mechanism of CycleGAN can be described in Eq. (2). That is, it needs to minimize the generator’s loss function value while maximizing the discriminator’s loss value so that a good dynamic balance between them can be achieved.

$$\begin{aligned} \min _{G, F} \max _{D_{Y}, D_{X}} \mathcal {L}\left( G, D_{Y}\right) +\mathcal {L}\left( F, D_{X}\right) . \end{aligned}$$
(2)

More specifically, there are three loss functions designed in CycleGAN, i.e., adversarial loss (GAN loss), cyclic consistency loss (Cyc loss), and identity loss (Idt loss), whose practical meaning and calculation can be described as follows.

GAN loss: GAN loss is used to make the distribution of images in the source and target domains as similar as possible and can be defined by:

$$\begin{aligned} \mathcal {L}_{G A N}\left( G, D_{Y}, X, Y\right)= & {} E_{y \sim P_{\text{ data } }(y)}\left[ \log D_{Y}(y)\right] \nonumber \\{} & {} +E_{x \sim P_{\text{ data } }(x)}\left[ \log \left( 1-D_{Y} \left( G(x)\right) \right) \right] ,\nonumber \\ \end{aligned}$$
(3)

where x, y are the low quality images in the source domain X and the high-quality images in the target domain Y, respectively, \(P_{\text{ data } }(x)\) and \(P_{\text{ data } }(y)\) represent the corresponding distributions of x and y, and \(E(*)\) is the expected value of the distribution function.

Cyc loss: Cyc loss can ensure that the reconstructed images can retain some characteristics of the original images and can be reconstructed back to the original images in reverse, which is the key for CycleGAN to get rid of the dependence on paired data. Its specific calculation can be defined by:

$$\begin{aligned} \mathcal {L}_{c y c} \left( G, F\right)= & {} E_{x \sim P_{\text{ data } }(x)}\left[ \left\| F\left( G(x)\right) -x\right\| _{1}\right] \nonumber \\{} & {} +E_{y \sim P_{\text{ data } }(y)}\left[ \left\| G\left( F(y)\right) -y\right\| _{1}\right] , \end{aligned}$$
(4)

where \(\Vert * \Vert _1\) denotes the calculation of the \(L_1\) norm.

Idt loss: Idt loss is used to ensure the continuity of the generated image, i.e., image x should be as close as possible to G(x) to some extent, and it can be described as follows:

$$\begin{aligned} \mathcal {L}_{\text{ idt }} (G, F)= & {} \mathbb {E}_{y \sim p_{\text{ data } }(y)}\left[ \Vert G(y)-y\Vert _{1}\right] \nonumber \\{} & {} +\mathbb {E}_{x \sim p_{\text{ data } }(x)}\left[ \Vert F(x)-x\Vert _{1}\right] . \end{aligned}$$
(5)

Total loss: The final loss function is expressed as the sum of all the above losses:

$$\begin{aligned} \mathcal {L}\left( G, F, D_{X}, D_{Y}\right)= & {} \mathcal {L}_{G A N}\left( G, D_{Y}, X, Y\right) \nonumber \\{} & {} +\mathcal {L}_{G A N}\left( F, D_{X}, Y, X\right) \nonumber \\{} & {} +\lambda \mathcal {L}_{c y c}\left( G, F\right) \nonumber \\{} & {} +\mathcal {L}_{\text{ idt }}, \end{aligned}$$
(6)

where hyperparameter \(\lambda \) is used to control the relative importance of the forward and reverse mappings.

Stage II: MSHy-Vit for slab number recognition

As mentioned earlier, Vit converts images into tokens for processing by performing a fixed-size patch operation on the image, which are similar to those employed in natural language processing. Considering that traditional Vit only performs a fixed-size patch operation, while in practice, the scale of images between different dataset are not standardly the same in length and width, and the image sizes are diverse. Therefore, the fixed-size patch operation is not beneficial for all computer vision tasks.

To address the limitations of single-scale patch operation and enrich the semantic information, we designed character-based and sequence-based multi-scale fusion recognition algorithms based on the characteristics of the slab dataset, respectively. Given that the two design ideas are basically the same, i.e., both are designed and completed based on a hybrid Vit model with multi-scale fusion, and only some of the parameters are set differently, only the former is described in detail here.

Fig. 6
figure 6

Framework of MSHy-Vit

  1. (1)

    Framework of MSHy-Vit

The framework of the MSHy-Vit algorithm for the character recognition of slab numbers is illustrated in Fig. 6, which consists of three branches with three different scales accordingly. The Image2Tokens (I2T), NAM, linear projection, transformer encoder, cross-attention, and MLP header are included within each branch.

As with traditional Vit, the image features obtained from the patch operation are encoded with position information and classification token, and then they are sent to the Transformer block. However, our algorithm has been improved in two main aspects compared to the traditional Vit.

On the one hand, based on our extensive experimental results, the direct patch operation on the input image is not satisfactory, which may be caused by the inherent weak feature extraction ability in Vit. Therefore, the I2T module is designed in our algorithm to take advantage of CNN’s powerful features extraction capability, whose specific schematic diagram is shown in Fig. 7. More specifically, the first part of the second residual block in ResNet18 is selected as the backbone of the I2T module. With the help of NAM [37], a simple yet effective attention module, the features extracted by I2T are further optimized (given the computational effort, NAM is not employed in the sequence recognition algorithm). Subsequently, the patching operation is performed on the level of extracted features. We would like to clarify that although ResNet18 is chosen as the backbone in this paper, in fact, other classical CNN network architectures, such as VGG and DenseNet, can theoretically replace it.

Fig. 7
figure 7

Framework of I2T module

On the other hand, after completing a series of operations in a single branch, there must be a feasible feature fusion mechanism to achieve the fusion of multi-scale information. Inspired by the cross-attention [38] proposed by IBM, we develop a purely attention-based multi-scale feature fusion strategy (A-MSFF), which is simple yet effective. The illustration of the feature fusion for branch L through A-MSFF is shown in Fig. 8.

Fig. 8
figure 8

The feature fusion process for branch L among other branches through A-MSFF

For models with three different scales (denoted by Branches L, M, and S, respectively), the implementation of feature fusion can be specifically described as follows:

The CLS token of branches L, \(\textbf{x}_c^{L}\) will first be dimensionally aligned through the projection function \(f^{L}(\cdot )\), and then complete the feature fusion through Concat with the patch token of the branch M, \(\textbf{x}_p^{M}\), and the \(\textbf{x}_{c}^{\prime \prime L}\) is obtained. A similar operation is then performed to complete the second feature fusion, i.e., the fusion of \(\textbf{x}_{c}^{\prime \prime L}\) and \(\textbf{x}_p^{S}\). Since the CLS token in each branch has already been trained to learn the global abstract information of its own branch, interacting with patch tokens of other branches can help to include information at different scales. Finally, \(\textbf{x}_{c}^{\prime \prime \prime L}\) will interact with its patch tokens \(\textbf{x}_p^{L}\) again at the next transformer block. The information learned from the other two branches is passed to the patch tokens, thus enriching the feature information of the patch token. The other two branches also complete the feature fusion with different branches by performing similarly.

Take branch L and the other two branches for feature fusion as an example. Before fusion, the feature information of branch L can be expressed as:

$$\begin{aligned} \begin{aligned} \textbf{x}^{L}=\left[ \textbf{x}_{c}^{L} \Vert \textbf{x}_{\text{ p }}^{L}\right] , \end{aligned} \end{aligned}$$
(7)

where \(\textbf{x}_{c}^{L}\) and \(\textbf{x}_{\text{ p }}^{L}\) are the classification tokens and image patch tokens of branch L, respectively. After dimension alignment,

$$\begin{aligned} \begin{aligned} \textbf{x}^{\prime L}=\left[ f^{L}\left( \textbf{x}_{c}^{L}\right) \Vert \textbf{x}_{\text{ p }}^{L}\right] . \end{aligned} \end{aligned}$$
(8)

Then \(\textbf{x}^{\prime L}\) will be fused with \(\textbf{x}_{\text{ p }}^{L}\) by the cross-attention mechanism (CA), the CA module can be expressed mathematically as follows:

$$\begin{aligned}{} & {} \begin{aligned} \textbf{q}&=\textbf{x}_{c}^{\prime L} \textbf{W}_{q}, \quad \textbf{k}=\textbf{x}^{\prime L} \textbf{W}_{k}, \quad \textbf{v}=\textbf{x}^{\prime L} \textbf{W}_{v} \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned}{} & {} \begin{aligned} \textbf{A}&={\text {Softmax}}\left( \textbf{q} \textbf{k}^{T} / \sqrt{C / h}\right) \end{aligned} \end{aligned}$$
(10)
$$\begin{aligned}{} & {} \begin{aligned} \textrm{CA}\left( \textbf{x}^{\prime L}\right)&=\textbf{A} \textbf{v}, \end{aligned} \end{aligned}$$
(11)

where \(\textbf{W}_{q}, \textbf{W}_{k}, \textbf{W}_{v} \in \mathbb {R}^{C \times (C / h)}\) represent the learnable position parameters, and C and h represent the embedding dimension and the number of headers, respectively. Similarly, the branch L will continue the fusion of features with the branch S, and \(\textbf{x}_{c}^{\prime \prime L}\) will be obtained. Finally, the classification tokens \(\textbf{x}_{c}^{\prime \prime \prime L}\) and the total tokens \(\textbf{x}^{L}\) of branch L can be described separately as follows:

$$\begin{aligned}{} & {} \begin{aligned} \textbf{x}_{c}^{\prime \prime \prime L}&=y^{L}(CA(g^{L}(CA(f^{L}({x}_{c}^{L}))))) \end{aligned} \end{aligned}$$
(12)
$$\begin{aligned}{} & {} \begin{aligned} \textbf{x}^{L}&=\left[ \textbf{x}_{c}^{\prime \prime \prime L} \Vert \textbf{x}_{\text{ p }}^{L}\right] . \end{aligned} \end{aligned}$$
(13)

The above describes the feature fusion process for branch L, and the same procedure will be performed for the other two branches. After several information interactions between branches, each branch will obtain its own CLS token and patch token. Since the task is classification, the CLS tokens of all branches will be extracted and concatenated for the final classification. The rest parts of the Transformer are the same as the traditional Vit and are not repeated, given that it has already been described in the data quality improvement model HybridCy.

  1. (2)

    Evaluation metrics and model ensemble

To evaluate the proposed algorithm comprehensively, three evaluation metrics widely used in image classification tasks are introduced, namely sensitivity, precision, and F1 score. The definitions of the three metrics are as follows:

$$\begin{aligned}{} & {} \hbox {Sensitivity} = \hbox {TP}/(\hbox {TP}+\hbox {FN})\\{} & {} \hbox {Precision} = \hbox {TP}/(\hbox {TP}+\hbox {FP})\\{} & {} F\hbox {1-score} = \hbox {2TP}/(\hbox {2TP}+\hbox {FP}+\hbox {FN}), \end{aligned}$$

where TP, FN, and FP are the number of true positive, false negative, and false positive. Sensitivity refers to the proportion of predicted positive samples out of all positive samples. Precision is the proportion of predicted positive samples that are actually positive to all predicted positive samples. F1 is the harmonic mean of sensitivity and precision.

In addition, each slab number sample in the dataset consists of ten characters, which makes the above evaluation metrics no longer applicable for evaluating the sequence recognition model. Therefore, the following two evaluation metrics are used to evaluate the sequence recognition model, which are the slab numbers character recognition accuracy \(P_1\) and the slab numbers recognition accuracy \(P_2\). Their specific calculation methods can be described as follows:

$$\begin{aligned} \left\{ \begin{array}{ll} P_1 = N_c / N_s \\ P_2 = N_t / N_a \end{array} \right. , \end{aligned}$$
(14)

where \(N_c\), \(N_s\), \(N_t\), and \(N_a\) are the number of correctly recognized characters, the total number of characters, the number of correctly recognized sequences, and the total number of sequences, respectively.

In particular, since our aim is to design a slab number recognition algorithm that can be applied in industrial sites, its performance and reliability should be high enough. Therefore, we have adopted a rather demanding calculation method, i.e., the sequence recognition result is considered correct only if all characters in a sequence are correctly recognized.

In addition, to improve the bias of the single model and maximize the recognition accuracy, the recognition results of the two recognition models are ensembled by the following strategy. Specifically, when the character recognition result is consistent with the sequence recognition result, the result will be output as the final result; otherwise, the two recognition results will be compared by bit, and the longest common recognition result found will be compared with the production information stored in the manufacturing execution system (MES). The matched result will be output as the final result if it matches the unique result. If no matched result can be obtained, an early warning will be issued to prompt manual intervention.

Dataset and experiment setting

Dataset

To verify the effectiveness of the proposed algorithm, the industrial slab data and CIFAR10 are adopted in the experiments. The images of slabs are taken from the monitoring video of actual hot rolling production in a major iron and steel company in China. After preliminary processing, 5000 sequence images of size \(32\times 240\times 3\) are gained first. Then each sequence image is first binarized through an adaptive threshold. Subsequently, the black and white pixels of the binarized image are counted and the upper and lower boundaries of the sequence, together with the left and right boundaries of each character, are obtained according to the statistical results. Finally, the segmentation of each character in the slab number is achieved based on the boundaries and character images are obtained and then divided into training, validation, and test set in the ratio of 8:1:1. For data quality improvement, the low-quality character dataset to be improved is the source domain. In contrast, the target domain consists of synthetic images of better quality, less noise, and higher definition. Figure 9 shows their examples, with the slab sequence dataset on the left, the character dataset to be improved on the top right, and the constructed high-quality synthetic dataset on the bottom right. Please note that due to the process specification, each slab number consists of 10 characters, which are selected according to specific rules among the ten numbers from 0 to 9 and three letters L, M, and N.

Fig. 9
figure 9

Some examples of datasets involved in the experiments

Experimental configurations

The proposed two-stage algorithm and its related strategies are implemented in PyTorch. All experiments are conducted on a personal computer with one GeForce GTX 1080Ti GPU, I7-7700K CPU, and 24 GB RAM. The implementation details related to the experiments can be described as follows.

  1. (1)

    HybridCy for unpaired data quality improvement

In the training process of HybridCy, the model weights are optimized by alternating training, that is, fixing the weights of the generator or discriminant to update the other one. Specifically, the batch size is set to 8, the maximum number of the epoch is set to 100, the network is initialized in Kaiming’s Normal way [39], and the optimizer is chosen as Adam with \(\beta _1=0.9\), \(\beta _2=0.999\). At the same time, the initial learning rate of the first 50 epochs is fixed to 1e−5 to ensure the learning ability of the model while providing a larger search space. While in the latter 50 epochs, to reduce the oscillation of the training process and ensure the stability of the training process, the learning rate is linearly decayed until it reaches 0. Consistent with the original CycleGAN network in part, the settings of relevant parameters we also keep in accordance with the original literature to obtain the best experimental results, such as the correlation weights of the generator and discriminator and Idt loss are set to 10, 10, and 0.7, respectively.

  1. (2)

    MSHy-Vit for slab numbers recognition

The configurations of the slab numbers recognition algorithms, including the MSHy-Vit for character recognition and the MSHy-Vit for sequences recognition, are set as follows. For both algorithms, the batch size is 32, the initial learning rate is 1e−3, the cosine learning rate adjustment strategy is adopted, and the decay period is 50 and 60, respectively. In addition, the optimizer is Adam with \(\beta _1=0.9\), \(\beta _2=0.999\), and the maximum number of iterations is 100 and 200, respectively. Besides, they both have three branches, and their patch sizes of them are (2, 2), (1, 1), (4, 4), and (4, 12), (8, 24), (16, 24), respectively, the internal stacking depth of each branch is (1, 2, 1) and (1, 5, 1). And the number of heads of the internal attention mechanism is 3 and 6, respectively.

Experimental results and analysis

Effectiveness analysis of HybridCy

  1. (1)

    Qualitative results.

The qualitative results can be seen in Fig. 10, where several representative visual samples are shown. For each picture, eight parts are included, i.e., x, G(x), F(G(x)), F(x), y, F(y), G(F(y)), and G(y) (the meaning of each part is shown in Fig. 2 in “CycleGAN in computer vision”). In the experiment, the forward loop, i.e., the reconstruction of the industrial slab data based on the distribution of the synthetic data is what we need, and the reverse loop only plays an auxiliary role. Therefore, G(x) in each graph is what we need most.

Fig. 10
figure 10

Qualitative evaluation of HybridCy

As shown in Fig. 10, the proposed HybridCy exhibits remarkable performance. It can reconstruct the images of the source domain well based on the distribution features of the target domain. Specifically, in terms of visual results, the images reconstructed by HybridCy show less noise, sharper edges, and a significantly higher degree of difference between the foreground and background than the actual industrial slab data, all of which are delightful features for deep learning tasks. So the conclusion can be drawn that the proposed HybridCy is largely effective in qualitative evaluation.

  1. (2)

    Quantitative results

Considering that the primary purpose of HybridCy is to improve the recognition performance of deep learning-based algorithms, we further checked whether the constructed data actually play an effective role in the final recognition task, in addition to the qualitative evaluation. To verify the algorithm’s performance more fairly, we conducted three groups of experiments with different levels of difficulty depending on the severity of the visual and geometric distortion of the images. Here, the test sets are 3000, 3000, and 2000 samples from easy to hard, respectively.

Fig. 11
figure 11

Group recognition results

The performance of HybridCy in groups ‘Easy’, ‘Medium’, and ‘Hard’ is illustrated in Fig. 11. Here, the ‘Hard’ group is composed of images with severe geometric distortion and blurring, which are difficult to distinguish even by the human eyes. As shown in Fig. 11, all three groups of reconstructed images are improved to different degrees. Relatively speaking, the ‘Hard’ group has the most obvious visual enhancement effect, and those original images with severe geometric distortion and blurring become distinct and less blurred after the reconstruction.

Recognition accuracy comparison of original data and reconstructed data are shown in Table 3. As shown in Table 3, for the original data, the recognition accuracies of the three groups from easy to hard are 99.13%, 98.01%, and 91.23%, respectively. After reconstruction by CycleGAN and HybridCy, the recognition accuracy of all three groups improved, with the ‘Hard’ group showing the most significant improvement, followed by the ‘Medium’ group, and the ‘Easy’ group showing the least improvement. Regarding the comparison between CycleGAN and HybridCy, CycleGAN is slightly superior to that of HybridCy on the ‘Easy’ group, and for ‘Medium’ group, the performance of them are comparable. While, for the ‘Hard’ group, the proposed HybridCy exceeds CycleGAN significantly, with improving by 0.84% compared to CycleGAN.

Table 3 Recognition accuracy comparison of original data and reconstructed data

The reasons behind this difference can be analyzed as follows. On the one hand, deep learning itself has a powerful feature extraction and learning capability, so for the ‘Easy’ group, when the data quality is high, the deep learning algorithm itself has enough ability to complete the given classification tasks. The situation is similar for the ‘Medium’ group, where the data quality is slightly reduced, but the features are still apparent. On the other hand, the ‘Hard’ group data have severe geometric distortion and significantly lower foreground–background resolution, making the features particularly inconspicuous. Thus, identifying such data places higher demands on the deep learning model. Without data quality enhancement, identifying poor data are easy to have a large bias. In contrast, the data quality enhancement model largely eliminates the interference noise of the original image, and the character features of the reconstructed images are more obvious than the background, which enables the deep learning model to better extract and learn the features and thus achieve higher classification accuracy. Finally, the proposed Hybridcy can outperform CycleGAN in the ‘Hard’ group mainly owes to the powerful Transformer mechanism and the early preprocessing of image features, they improve the stability of the model and the quality of the input features, respectively, which finally contribute to achieving better results.

  1. (3)

    Comparison with other image processing methods in groups

In this section, comparative experiments are conducted to further evaluate the performance of the proposed HybridCy. The two most representative traditional image processing methods, namely bilateral filtering (BF) and adaptive thresholding (AT), are chosen as the comparative algorithms. In addition, the conventional CycleGAN, which is most relevant to the proposed algorithm, is also presented, and the experimental results are shown in Fig. 12.

Fig. 12
figure 12

Comparison of the image processing performance of several methods

Based on the visual comparison results shown in Fig. 12, the following conclusions can be briefly summarized. Bilateral filtering, a traditional image processing method, is almost ineffective, and it can neither effectively denoise nor reconstruct characters well. It is noteworthy that, besides bilateral filtering, other traditional image filtering methods were also experimented, while the results were also unsatisfactory. The main reason for the failure of the traditional image filtering methods is that the poor industrial slab data have low resolution and the difference between foreground and background is not obvious, so it is difficult for the simple image filtering methods to achieve effective improvement of data quality. On the other hand, the adaptive thresholding method has satisfactory results on the ‘Easy’ group, while the biggest challenge of this method is the setting of the threshold value. It is very challenging to adjust reasonable threshold values adaptively to achieve good binarization for all data because it can easily fail in the ‘Hard’ group.

Compared with the above two methods, CycleGAN and HybridCy reconstruct images by learning the pixel distribution in the source and target domains to improve the image quality. As shown in Fig. 12, these two methods are more effective and robust than the traditional image processing methods. They can achieve better reconstruction results for data of different quality. In addition, the images reconstructed with the proposed method also perform better in terms of connectivity and edges compared to those obtained by the CycleGAN algorithm, because the images reconstructed with the proposed method are the smoothest and have the least residual noise. This is more evident in the ‘Hard’ group, which validates the effectiveness of the proposed strategy.

  1. (4)

    Limitation analysis and post-processing

Although good robustness and compelling results can be achieved in most cases, and the HybridCy has been proven to be truly helpful for the recognition tasks, there are still some less satisfactory cases in the results, and a few typical unsatisfactory cases are shown in Fig. 13.

Fig. 13
figure 13

Samples for unsatisfactory cases

As can be seen from Fig. 13, the characters in the images can usually be reconstructed well, and the text features in the reconstructed images are more prominent. However, pretzel noise appears in some of the reconstructed images. Besides, it has to be admitted that a small number of reconstructed images show the confusion of characters before and after reconstruction, and this situation mainly occurs between letter M and N. This phenomenon may be due to the feature distribution of the synthetic training data for the former, while the latter is most likely due to the imbalance of the training data, where the number of letter N is much smaller than that of M. Therefore, for the above two cases, we further use some basic image processing techniques to deal with the pretzel noise in the images, and the post-processed images are shown at the bottom of Fig. 13, which are pretty good. In addition, by increasing the number of letter N in the dataset, the confusion between the two letters before and after reconstruction can be significantly improved.

Fig. 14
figure 14

Confusion matrix of character recognition

Table 4 Performance analysis of characters recognition about each category

Performance analysis of MSHy-Vit on industiral slab dataset

In this part, the performance of the proposed MSHy-Vit models for both the character recognition and the sequence recognition tasks of slab numbers are validated based on an industrial dataset.

Character recognition

Figure 14 shows the confusion matrix obtained by the character recognition algorithm for each category of the test set. At the same time, Table 4 summarizes the statistical metrics of sensitivity, precision, and F1 score of the model for each category. As shown in Fig. 14 and Table 4, the results are relatively satisfactory for both sensitivity, precision, and F1 score, and their mean values reach 0.998, 0.986, and 0.986, respectively.

In addition, considering that each slab number consists of ten characters, we also evaluated the segmentation recognition results on a dataset of 5000 characters segmented from the 500 slab sequences, where the accuracy of character recognition is 98.87%, and the accuracy of slab numbers recognition is 90.5%.

Sequence recognition

The results of slab numbers sequence recognition are shown in Table 5, where the total character recognition accuracy and sequence recognition accuracy \(P_1\), \(P_2\) are 99% and 96.7%, respectively. That is, 5000 slab sequences contain a total of 50,000 characters, and when these characters are recognized by the proposed MSHy-Vit model, 4835 sequences and 49,500 characters can be correctly recognized, respectively.

Additionally, Fig. 15 presents the trends of the loss function values and recognition accuracies of the above recognition process on the training set and validation set as the epoch increases. As can be seen from Fig. 15, the loss functions on the training and validation sets can achieve effective convergence and converge quickly. At the same time, the recognition accuracy on the validation set reaches about 99% and 97%, respectively, which demonstrates the effectiveness of the proposed models.

Table 5 Performance analysis of sequences recognition
Fig. 15
figure 15

The training trajectory of the training process of two recognition models

To further improve the stability and recognition accuracy of the model, the character and sequence recognition models are ensembled by the strategies mentioned in “Stage II: MSHy-Vit for slab number recognition”. The recognition accuracies before and after the model ensemble are shown in Table 6, and it can be seen that the ensemble model achieves an accuracy of 98.5%, a result that is 8% better than that of character recognition and 1.8% better than that of sequence recognition, respectively.

Comparison with previous methods

In this section, the classification performance of the proposed model MSHy-Vit is validated against a series of competitors. To provide more objective results, the reconstructed images from the first stage instead of raw industrial data are used as input for all models. Comparative experiments include the following categories, character recognition, sequence recognition, and existing algorithms on slab numbers recognition (denoted by ‘M1’, ‘M2 ’, and ‘M3’ respectively). The specific experimental comparison results are shown in Table 7, where in addition to the recognition accuracy of the models, the number of parameters of the models, the algorithm architecture, the time required to train the models and the throughputs of the models during testing are also given respectively.

Based on the results in Table 7, when compared to the most representative CNN algorithms, the most significant improvement obtained by our algorithm is 0.61% over DenseNet121. In addition to this, the proposed MSHy-Vit algorithm also has the advantages over several other CNN algorithms. Regarding the comparisons against the Transformer models, the proposed MSHy-Vit improves the recognition accuracy by 0.38% and 0.33% compared to Vit and Cross-Vit, respectively.

Compared with the existing slab numbers recognition algorithms, before ensemble, the accuracy of the proposed character-based segmentation slab numbers recognition algorithm is 90.5%, slightly lower than that of DCNN and GDT-FCN. However, the accuracy obtained by our sequence recognition algorithm surpasses those obtained by the two comparison algorithms by 5.5% and 4.5%, respectively. In addition, the model after ensemble surpasses the two comparison algorithms significantly in terms of accuracy.

Regarding \(T_{train}\) and throughput, although our model takes longer to complete training compared to CCN algorithms Vit-tiny and CRNN, it has an advantage in terms of accuracy. In fact, our goal is very clear, i.e., to design a model with high accuracy and stable performance for online applications in real industrial scenarios. The test throughputs of MSHy-Vit-C, MSHy-Vit-S and ensemble models are 121, 149 and 111, respectively, which fully satisfy the needs of real applications. In addition, industrial application-level algorithms have extremely high requirements in terms of reliability and stability, and any errors can cause huge economic losses. Therefore, although some models outperform the proposed algorithms in terms of speed, the accuracy of these algorithms is poor and cannot meet the requirements of industrial application level.

Table 6 Performance analysis of ensemble model
Fig. 16
figure 16

Examples of recognition results for several end-to-end slab number recognition algorithms

Table 7 Comparison of the proposed algorithm with related methods on industrial slab dataset

The main reasons for these results lie in two aspects. The first one is the advantage brought by the multi-scale fusion. By patching images at different scales, features of different scales can be obtained, and accordingly, the feature enrichment will inevitably bring benign contributions to the convergence speed and accuracy of the model. The second one is the advantage brought by the addition of I2T module. Although the Transformer model has solid global information modeling capability, its ability to extract the underlying features is weaker than that of CNN. However, the design of the CNN-based I2T module makes up for this deficiency.

At the end of this section, Fig. 16 shows an example of slab number recognition results for end-to-end sequence recognition, where the data is divided into three groups ‘Easy’, ‘Medium’ and ‘Hard’, just like “Effectiveness analysis of HybridCy”. As shown in this figure, the proposed MSHy-Vit-S algorithm outperforms the other comparison algorithms in all three groups, and MSHy-Vit-S appears to have the lowest false recognition rate. These visualizations are consistent with the statistical-based results in Table 7, further demonstrating the effectiveness of the proposed algorithm.

Parameter sensitivity analysis of MSHy-Vit

In this section, we conduct further experiments on MSHy-Vit for parameter sensitivity analysis. Specifically, the stacking depth of the Transformer encoder blocks is an important hyperparameter that may have a significant impact on the performance of the model, so we explored the performance of MSHy-Vit-S at depths of 3, 5, and 7, respectively. The experimental results are shown in Fig. 17.

Fig. 17
figure 17

Sensitivity analysis of the stacking depth of the transformer encoder blocks

Table 8 Comparison of the proposed algorithm against peer competitors on CIFAR10

In Fig. 17, the highest accuracy (98.64%) is obtained on the validation set when the depth is 5. Although the model converges faster in the early stage at the depth of 7, it does not obtain the highest accuracy due to the overparameterization of the model, which tends to cause overfitting. Meanwhile, at a depth of 3, the model has the lowest accuracy of 96.99% on the validation set, which may be due to the simplicity of the model and thus the lack of learning ability.

Performance analysis on benchmark datasets

Experimental results on industrial datasets illustrate the advanced performance of the proposed MSHy-Vit. To validate its generalization capability, further comparative experiments were conducted on CIFAR10 (the experiments were performed only on MSHy-Vit-C since the length and width of the images in CIFAR10 are consistent). The experimental results are shown in Table 8, where the symbol ‘−’ means that the corresponding competitor has no publicly reported results. The competitors’ results are the best results reported in their original papers, where both Transformer and Hybrid’s results are the performance of transfer learning, i.e., training the model on ImageNet1K and then fine-tuning the pre-trained model on CIFAR10. It is worth noting that only the validation set of ImageNet1K was used in the training of MSHy-Vit-C (the training dataset is 20 times larger than the validation set), considering the limitations of our hardware devices.

As shown in Table 8, the proposed MSHy-Vit-C has comparative advantages over the CNN-based algorithms. In particular, MSHy-Vit-C improves the accuracy by 2.86% and reduces more than half of the parameters compared to DenseNet24. Although MSHy-Vit-C is slightly inferior in accuracy compared to the listed Transformer and Hybrid algorithms, this difference is not significant. It is worth noting that the pre-training of the proposed MSHy-Vit-C is only performed on the ImageNet1K validation set due to hardware device limitations, so this small difference does not mean that the proposed MSHy-Vit-C is not promising. In contrast, despite the fact that the amount of data used is significantly less than that used by the competitor algorithms, the performance of MSHy-Vit-C does not degrade significantly, so its potential advantages can be indicated to some extent.

Potential application and discussion

As described in “Introduction”, the recognition of slab numbers is critical for the hot rolling production process. However, restricted by the environment, the recognition of slab numbers often faces many challenges. In addition to the pre-processing of the original data, two slab numbers recognition algorithms have also been proposed. In practical application, combined with the MES, we further improve the algorithm’s reliability. The final online recognition accuracy of 98.5% is achieved, which can assist the operators in identifying the slab numbers timely and provide alerts and reminders to the field personnel for abnormal recognition results.

Conclusion

To design an industrial application-level slab numbers recognition algorithm, we proposed a two-stage hybrid recognition algorithm based on CNN and Transformer by improving both data quality and the deep learning model. First, HybridCy, a data quality enhancement algorithm that can be applied to real-world unpaired data, was proposed to solve the dependence of existing methods on paired data. Secondly, two recognition algorithms for slab number were designed based on the proposed multi-scale hybrid Vit model MSHy-Vit. The effectiveness of the proposed algorithm was verified on the actual industrial data obtained from a major iron and steel enterprise. The experimental results show that the proposed HybridCy algorithm can effectively improve the visual quality of the data and the accuracy of the following image recognition task, especially for low-quality data sets with severe geometric distortion. In addition, the proposed MSHy-Vit algorithm for slab numbers recognition combines the advantages of both CNN and Transformer. It shows comparable or even better performance than many powerful algorithms in the literature. In practical application, the reliability of our proposed algorithm can be further improved by ensembling the two recognition algorithms and with the help of the MES system. Finally, an recognition accuracy of 98.5% is achieved by our algorithm that can meet the on-site industrial requirements and thus effectively improve the automation level of the hot rolling production process. Our future research will focus on the industrial application in the slab numbers recognition task in hot rolling mills and how to optimize the model further to reduce its complexity effectively.