Automatic monitoring of steel strip positioning error based on semantic segmentation

The misalignment of steel strips in relation to the roller table centerline still is an impairment for the rolling mill production lines. Nowadays, the strip position correction remains largely in the purview of human analysis, in which the strip steering is traditionally a semi-manual operation. Automating the alignment process could reduce the maintenance costs, damage to the plant, and prevent material losses. The first step into the automatization is to determine the strip position and its referred error. This study presents a method that employs semantic segmentation based on convolution neural networks to estimate steel strips positioning error from images of the process. Additionally, the system mitigates the influences of mechanical vibration on the images. The system performance was assessed by standard semantic segmentation evaluation metrics and in comparison with the dataset ground truth. The results showed that 97% of the estimated positioning errors are within a 2-pixel margin. The method demonstrated to be a robust real-time solution as the networks were trained from a set of low-resolution images acquired in a complex environment.


Introduction
Steel strips are manufactured from cast slabs, which undergo several times between a pair of work rolls with decreasing gaps until the achievement of the intended thickness reduction [20,21,39,40]. In Steckel mill lines, during the hot rolling process, the strips are driven by the roller table in the rolling direction. However, the rolling procedure is susceptible to impel the strips perpendicularly to this direction, which could induce misalignment. For instance, the unaligned strips are prone to collide positioning error, which could be accessed from the images of the process.
The available literature presents a considerable number of studies with solutions for metalworking. These include the detection of defects in metal casting [5,10], recognition of slab identification [22], prediction of mechanical properties [38], bearing fault diagnosis [14,19,29,35,41], and steel defect identification [25,28,31]. On the other hand, there are only a few works that apply image processing to measure the strip position in a rolling mill process. From the best of the knowledge of the authors, only two studies could be found. The first applies traditional image processing applying Bezier curves to measure the strip centerline in a rolling process [4]. This study was performed in a dataset with limited size. The second study applies morphological operations in processed images to calculate steel strip positioning error [9]. However, the work lacks validation from a ground truth set.
In this paper, a novel method to access the strip positioning error is presented. The system employs semantic segmentation based on convolution neural network (CNN) to extract the strip portion of the images of the process, infers the strip position, and estimates the positioning error in relation to setpoint images. Moreover, an analogous approach is performed to attenuate the influences of the camera mechanical vibration to the images. It consists of applying a CNN-based semantic segmentation method to identify the position of a static mill component that appears on the images. This position is latter selected as the reference point of the strip relative position.
The system performance was assessed by standard semantic segmentation evaluation metrics and in comparison with the positioning error signal derived from the dataset ground truth. The method proved to be a robust realtime solution as the networks were trained from a set of low-resolution images acquired in a complex environment, containing steam and variable luminance.
The remainder of this paper is divided into the following sections: "Theoretical basis," "Methods," "Results and discussion," and "Conclusions".

Process description
The rolling mill process consists of the thickness reduction of steel slabs from successively passing it through a pair of work rolls with a decreasing gap. The Steckel mill process differs from the traditional rolling by the presence of coiler furnaces in either side of the rolling stand, as illustrated in Fig. 1, which reheat the strips while wounding upon their drums. This process allows the product to reach lengths approximately equal to 600 m [11,20]. The carbon steel heating produces an oxide layer over the strip, which is removed before each pass of the strip through the work rolls by high-pressure water, in a process denominated descaler, originating heavy steam.
During a few steps of the process, the side guides align the strip in relation to the roller table centerline before the rolling operation. From this position, the strips are moved by the roller table in the rolling direction. Moreover, the rolling induces the strips to move perpendicularly to this direction, according to the illustration present in Fig. 2a. This fact tends to cause a strip misalignment, which could lead to collisions and, consequently, process losses [11]. Nowadays, the strip position correction remains largely in the purview of human analysis [4]. The realignment process is executed by an operator, which judges the strip misalignment from real-time images of the process. The operator attempts to compensate for the undesirable direction of the movement tilting the work rolls with a manual command. This procedure creates an asymmetrical gap between the work rolls, inducing higher reduction forces on the side of the smaller gap and lower forces on the opposite side. Therefore, the wedging effect will steer the strip in the direction of the larger gap, which corresponds to the direction of the desired position [11], as shown by Fig. 2b.

Semantic segmentation
Semantic segmentation is a pixel-wise categorization, which gathers pixels belonging to the same class [24,28,30]. Regarding digital image processing, the method is best applied as an emulator for human pattern identification [28]. Compared to the traditional image segmentation, semantic segmentation based on convolution neural network has demonstrated considerable advantages [28] and has been applied to many tasks such as medical applications [2, Fig. 1 Steckel reversible rolling mill with coiler furnaces. 1: rolling stand, 2: coiler furnace, 3: winding drum of the coiler furnace, 4: work roll, 5: backup roll, 6: descaler jets, 7: steel strip. Source [1,20] Fig. 2 Strip misalignment during rolling and correction through roll tilting [11]. a Strip movement directions. b Effects of roll tilting. The procedure creates an asymmetrical gap between work rolls, in which the high reduction forces steer the strip in the direction of the larger gap 26,27,42], in autonomous driving [6,8,12,32], object detection [13,37], and pose estimation system [33], to name a few. The semantic segmentation architecture usually consists of an encoder-decoder task [3,15,23]. The first part, composed of convolution and max pooling operations, extracts high-level features by mapping the input to a lower dimension representation [23,36]. On the other hand, the architecture of the decoder, commonly composed of transposed convolutions and up-pooling layers, expands the high-level features, recovering the feature map size compatible with the input layer size [28].

Methods
This work adopted supervised learning to estimate the positioning error of steel strips in a Steckel mill line. The method employs hybrid semantic segmentation to estimate the strip position through images of the process and calculates their positioning error in contrast to values derived from setpoint (or reference) images. Additionally, the system mitigates the influences of mechanical vibration on the process videos. A concise explanation of the adopted methodology is presented in the flowchart of Fig. 3, and further clarification is exposed in the section remainder.
The dataset is composed of RGB images, which were acquired from an analog camera installed over the mill stand entrance, on the operator side, with a sample rate of 30 fps. From the acquired images, the algorithm gathered the images of interest according to the activation command of the descaler and strip tracking signals. During the descaler  process, the heavy steam content present on the region between camera and strip makes the image manipulation unfeasible. Also, during part of the acquisition time, the strip is not positioned on the field of view of the camera. Therefore, the images captured in both circumstances will not be processed by the positioning error estimation algorithm.
Succeeding the dataset selection, the centralization command signal of the side guides is used to categorize the images of interest into position setpoint images or images under analysis. When the signal is active, the images are classified into setpoint images, since the side guides ensure the strip centralization, as aforementioned in Section 2.1. The system stores the reference image and estimates the strip position of both categories. The positioning error is then calculated by comparing the actual image strip position and the position of the last detected setpoint image.

Regions of interest
An example of the images of the process used as the system input is presented in Fig. 4a. As highlighted on the figure, the images contain a portion of the strip aligned to the image bottom edge, side guide elements, and parts of the mill structure. Therefore, to reduce the image complexity, a region of interest for strip position estimation (ROI1) was elected, lowering the number of mill components present on the image.
The impacts of the rolling process cause unwanted camera vibration, which is unavoidable as the camera is placed 6 m above the strip. This fact can be observed in the image present in Fig. 5a, which shows parts of the mill structure of two consecutive frames. From this figure, it is perceptible that the distance between the mill structures and the horizontal dashed line varies in a considerable amount. Empirical observations revealed that the mill structures adjacent to the strip present an irrelevant relative movement in relation to the strip. Hence, to avoid interference from the camera vibration effects on the estimated values, the mill structure parts visible on the images of the process were used as a strip position reference. These structures are present in the region of interest 2 (ROI2), indicated in Fig. 4b. In ROI2, the mill parts delimit a polygon, which centroid is used as the mentioned position reference. This polygon and its centroid are highlighted in Fig. 5b.

Labeling
The ground truth labeling of each region of interest was created by manual annotation utilizing the Image Labeler Matlab app. Pixels belonging to the reference polygon or strip portion were assigned to the intensity value 1, and pixels belonging to the background were assigned to value 2. In the total, 1390 labels were created for each region of interest, as the selected dataset comprises 1390 704 × 480px RGB images. The images were acquired in a complex environment. The remaining elements from the descaler process, such as water over the strip and steam content on the strip location and surroundings, compromise the image quality. The water creates unpredictable patterns over the strip, as can be perceived in Fig. 6-2b, 2c, 1b, and 1c. On the other hand, the steam content blurs the acquired images. Figure 6-2a, 2c, and 1a show some of these blurring particularities. Another complication is the strip incandescence, resulting from the strip high temperature, which reflects over the side guide structures present on ROI1. These structures mirror the strip color and could be easily mistaken as strip portions, similar to the effects indicated in Fig. 6-2b, 1a, 1b, and 1c by white arrows. The labeling process was handled carefully by considering these occasions, to avoid misclassifications. In cases that portions of the strip were covered by steam, the labeling considers that the strip location is parallel to the image bottom.

Semantic segmentation
In the present work, two semantic segmentation approaches based on CNN operations were applied to detect, independently, the strip portion present in the ROI1 and the polygon delimited by the mill components present in ROI2. Three architecture configurations were investigated for each region of interest (ROI). The architectures differ from each other by the number of encoder/decoder layers, varying between 1, 2, or 3 pairs. Figure 7 illustrates the largest network in terms of the number of layers (three encoders and three decoders) applied to each ROI. The encoder part consists of downsampling layers, which include convolution and max pooling operations. In contrast, the decoder architecture comprises upsampling layers, which consists of transposed convolutions. After each convolution and transposed convolution layer, the Rectified Linear Unit (ReLU) activation function was applied. Adam optimizer and a learning rate of 0.001 were selected as optimization parameters. Moreover, the influence of the number of filters in each operation was also ascertained. The number of filters could hold values from the set {2, 4, 8, 16, 32, 64}. Hence, eighteen architectures were explored altogether for each ROI.
The encoder-decoder architecture is followed by a pixel classification layer, which enables a pixel-level classification and is composed of convolution and a Softmax  Table 1, in which the layer nomenclatures refer to those introduced in Fig. 7.
Concerning the dataset, it was initially composed of 1390 images, which was split into training and test datasets on the proportion of 1112 to 278 images, respectively.

Positioning error estimation
The CNN predictions of the ROI1 were refined via morphological operations and outlier exclusion to adjust misclassifications induced by the presence of complex elements in the image. Details about these elements are mentioned in Section 3.2 and exemplified in Fig. 6. The probabilistic mask of the strip, obtained from the segmentation, was binarized. From the image, the connected components were identified and deleted, except for the larger connected component. This operation keeps the strip area and eliminates smaller and disconnected elements erroneously classified as part of the strip portion, such as steam content and mill components that reflect the strip incandescence. Afterward, the possible holes in the strip area, mostly caused by water over the strip and fluctuation in the incandescence intensity, were filled by a floodfill operation. From the resulting binary image, the pixel locations of the top edge of the strip portion were employed to estimate the strip position. Then, an outlier removal with a threshold between 40 and 60% was utilized to prevent interference from possible irregular edges. The strip position in relation to the image bottom edge (Y I mStrip , for the actual image, and Y StpStrip , for the setpoint image) was determined as the average of the remaining values.
Similarly, the improvement of the ROI2 polygon portion predicted by semantic segmentation was performed via morphological operations, which included the application of flood-fill, erode, and dilate. The flood-fill operation fills holes in the polygon prediction originated by misclassification. The erode and dilate functions separate mill components, other than the desired polygon, into smaller and disconnected components. Thus, a greater connected  Fig. 8 show an example of an image under analysis, Fig. 8a, and a setpoint image, Fig. 8b. Besides, the figures also present the reference systems used to derive the strip positions and further strip positioning error. The global coordinate system (XOY ) is used to derive the strip position in relation to the image bottom edge (Y I mStrip , for the actual image, and Y StpStrip , for the setpoint image). On the other hand, the local coordinate system (xcy) aims to mitigate the camera vibration influences over the strip position estimation by providing a static reference position in relation to the process environment. This coordinate system has its origin at the centroid of the polygon composed by the mill components (c), and it is located apart from the X-axis in Y I mc , for the actual image, and Y Stpc , for the setpoint image. The strip position relative to this coordinate system can be calculated by Eq. 1 (actual image) and Eq. 2 (setpoint image). The strip positioning error is given by the difference between both values (Eq. 3).
• Strip position of the actual image relative to the local coordinate system (mill structure reference point) • Strip position of the setpoint image relative to the local coordinate system (mill structure reference point) • Strip positioning error

Representation of the positioning error in physical units
On the periods when the strip is not positioned under the camera one of the roller table rolls is visible. The diameter of the roll can be measured on the image and is equal to 140 px, while its physical dimension is 400 mm. Therefore, the resolution of the images is equal to 2.9 mm/px.

Performance evaluation
The proposed method is evaluated by comparing the estimated values of strip position, y-coordinate of the mill Fig. 8 Schematic representation of the global and local coordinate systems used to estimate the strip position relative to the mill components of the a actual image and b setpoint image. In both figures, XOY is the global coordinate system and xcy is the local coordinate system reference position, and positioning error to the expected values, calculated from the ground truth images, by the mean absolute error (MAE) and the standard deviation (STD). The computational burden was also analyzed through frame rate analysis to access the real-time viability of the application, and the execution was carried out on a NVIDIA GeForce GTX1080 GPU. Also, each architecture evaluation was performed on the test sets by common metrics, applied to evaluate semantic segmentation based on convolution neural networks. These metrics are recall (Eq. 4), Jaccard index (Eq. 5), F1 score (Eq. 6), and specificity (Eq. 7), and they are determined from the pixel predictions of the segmented mask, which are the values of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) [7,24,28,33,34] • Recall

Recall = T P T P + F N (4)
• Intersection-over-Union (IoU) or Jaccard index • True negative rate (TNR) or specificity Specif icity = T N T N + F P (7)

Results and discussion
This paper presents an accurate and automatic strip positioning error estimator for a rolling mill process. The system as a whole is composed of a couple of distinct steps, as discussed in Section 3. As a matter of fact, the performance of each step can be evaluated separately. In this section, the obtained results are discussed in the following order: (i) steel strip edge and mill structure location estimation and (ii) steel strip misalignment evaluation. The results present on Section 4.1 are referring to the test set, while analyses of Section 4.2 are based on both test and training sets.

Steel strip edge and mill structure location estimation
At the base of the developed system lies a semantic segmentation neural network. The purpose of this step is to classify each pixel in the current frame as an object of interest (steel strip or mill structure component) or as background. By performing this classification, it is possible to reject most of the irrelevant information present in the image. The results attained by the CNN model does not hold much meaning for the final application on its own. Because of that, the results of position estimation are presented together. Results for the first region of interest are shown in Table 2, while Table 3 shows results for the second region. Both regions are judged by the same metrics. Moreover, the first four results columns (recall, IoU, F 1 score, and specificity) refer to the segmentation step. In contrast, the latter two (MAE and STD) relate to the position estimation errors of the components. The identification of the strip by the convolution network is performed with very high accuracy. For all types of architectures and numbers of convolutional filters, all four considered metrics exhibit values above 99%. Since the effectiveness of every model is virtually the same, the model choice should be based on the remaining criteria. Before turn to model accuracy, the computational burden of the system must be assessed, since real, or at least quasi-real, execution time is sought. The frame rate of the camera feed is 30 fps. Therefore, it is a desirable model with a superior execution rate, as this rate refers only to this intermediate step. The final execution frequency will be presented later in this section. Taking this and the achieved strip estimation accuracy, the elected model configuration, which better balances both objectives, is Model 3. With architecture type 1 (one encoder/decoder pair) and 8 convolutional filters in each layer, the model achieved mean absolute error of 3.4 (± 4.0) mm and frame rate of 40.8 fps.
For the identification of the mill structure, the segmentation process yield very poor results. The main reasons behind this inefficiency are twofold: there is a severe imbalance between the number of pixels in the reference polygon and the background. Secondly and most importantly, the tonality of the reference pixels is not distinct from other portions belonging to the background. These assumptions are confirmed by analyzing the results. The system exhibits a negligible false positive rate and a high false negative rate, which translates to mediocre recall and IoU while keeping a high specificity. In other words, the network tends to classify background pixels as a region of interest wrongly, but it does not misclassify background regions. Under these circumstances, the reference polygon could still be successfully identified after the morphological operations and grouping in connected components. Therefore, the reference position, which is derived from the polygon centroid, could be estimated. Regarding the frame rate, since the input size is relatively small, all models are lightweight with execution rate above the requirements of the system. The model selection can be based solely in terms of average error and its dispersion. The most suitable network was Model 12, with 2 encoder/decoder pairs and 64 filters in each layer, attaining errors of only 4.1 (± 3.8) mm and frame rate of 71.0 fps.
The results of the position estimation of the strip, by Model 3, and centroid, by Model 12, are also exemplified by the samples presented in Fig. 9, for ROI1, and Fig. 10, for ROI2. Both figures contain four pairs, from which we can observe the input of the region of interest on the left and the output on the right. The output image consists of the input image overlaid by the segmentation classification and the position estimated by the whole algorithm, including the morphological operations mentioned in Section 3.4.
The position estimated is shown as a black line, for the strip position, and a black dot, for the centroid position. The system demonstrated robustness by correctly placing the black line and black dot even when the segmentation step provides insufficient results. The proposed method can estimate the strip and centroid positions even in a complex environment, such as the steam presence ( Fig. 9 pair 2b, and Fig. 10a, b, and d), mill structures reflecting the strip Fig. 10 Pairs of ROI2 containing the acquired image, on the left, and the semantic segmentation prediction, on the right. The black dot over the prediction image indicates the position of the centroid estimated by the system. Steam presence can be observed in pairs 1a, 1b, and 1c incandescence ( Fig. 9 pairs 1a and 1b) and water presence over the strip (Fig. 9 pairs 1a, 1b, and 2a).
The visualization of the expected and estimated values in each frame, as well as the absolute difference between them, are shown in Fig. 11. The top graph shows the results for the steel strip, while the bottom one refers to the centroid of the polygon. For more clarity, the images also present a magnified view of a signal stretch. As it can be observed, the estimations follow the expected values closely.

Steel strip misalignment evaluation
The goal of the present work is to determine the misalignment of the hot strip during the rolling process. This misalignment is assessed by the comparison between the steel strip location after the mechanical vibration influences are mitigated, and setpoint images. Graphical results of the system can be observed in Fig. 12. In the first image, Fig. 12a, it is shown the real deviance of the strip from its desired position and the estimated one by the developed system. The absolute difference between these quantities is depicted in Fig. 12b, whereas a histogram portrays its distribution in Fig.12c.
As can be observed in the histogram, the vast majority of the data deviates from the desired value in less than 5 mm. In fact, this deviation corresponds to merely 2 pixels and 97% of the samples lies within this range. The mean absolute error and standard deviation achieved by the method were 2.06(± 1.7) mm. Moreover, the system is capable of performing the strip misalignment estimation in real-time, at a rate of 26.4 fps.
Although the manual operator performance cannot be accessed due to lack of data, the available literature presents studies of the human capability regarding reaction time. Studies of visuomotor reaction time (VMRT), in which participants executed a visuomotor reaction task in response to visual motion stimuli, showed that badminton players, table tennis athletes, and non-athletes presented VMRT of 244.2 ms, 258.4 ms, and 273.6 ms, respectively [16][17][18]. A rather distant approach for comparison, if a human operator had a similar performance of a badminton player, which corresponds to a reaction frequency of   4.460 fps, the system's performance (26.4 fps) would overcome the human operator. Additionally, the majority of the data diverges from the expected value in less than 5 mm, which is equivalent to less than 2 pixels in the images. Considering that the strips are rolled at 10 m/s, it is reasonable to assume that the system also outperforms the current manual operations in terms of precision.
Therefore, the presented approach can successfully estimate the steel strip misalignment with precision and response time well beyond the current human operator capabilities and, consequently, process specification.

Conclusions
The system presented in this paper aims to estimate steel strips positioning error. The method employs semantic segmentation to estimate the strip location from images of the process and to identify mill structure fixed parts used to attenuate the influences of camera vibration on the results. The identification of the strip by the convolution network achieved high accuracy for all tested models, in contrast to the mill structure, in which the identification by the CNN accomplished considerable results only for a few models. However, the application of morphological operations refined semantic segmentation predictions. As a result, the elected models achieved the strip location and mill elements parts with mean absolute errors of 3.4 (± 4.0) mm and 4.1 (± 3.8) mm, respectively.
Additionally, the method can successfully estimate the steel strip misalignment, presenting 97% of the estimated values within a 2-pixel margin. Also, concerning the positioning error, the mean absolute error attained by the system is 2.06(± 1.7) mm. Regarding the execution time, the method presented reduced computational cost per frame, with an approximate frame rate of 26 fps. All thoughts considered, the approach also proved to be a robust real-time solution as the dataset is composed of low-resolution images acquired in a complex environment. Future work will be carried out for the integration of the developed solution in a feedback control system designed to reject strip positioning errors.