1 Introduction

Optical character recognition (OCR) is an important research field in the field of pattern recognition [1, 2]. It combines digital image processing, computer graphics and artificial intelligence. Digital recognition [3, 4] is also an important research direction and component of OCR. Therefore, it has attracted a large number of scholars to study digital recognition.

In the intelligent express sorting system, as long as the express end sorting label code is obtained, the address that the express is sent by the courier can be known. If the process of express sorting can be automatically processed by the computer, and the express end sorting label code is automatically extracted and accurately identified from the express bill, the time and effort for manually processing the data can be saved. Therefore, it is especially important to automatically recognize the express end sorting label code in the intelligent express sorting system.

However, the traditional digital recognition methods, such as template matching method, structural feature method and logical reasoning method, are very difficult to extract features, and the recognition effect is not very good. Until the rise of deep learning represented by convolutional neural network [5, 6] (CNN), the problem of feature extraction difficulties was solved. Therefore, a large number of scholars proposed the use of a projection method to segment a single character and then input it to CNN for character classification to achieve the purpose of recognition. But in real life, such as ID number, license plate number and house number, these are often a sequence. Unlike previous recognition methods, identifying these sequence objects often requires identifying a series of object labels rather than a single label. Therefore, it is necessary to regard the recognition of these objects as a sequence identification problem. Since CNN cannot handle string recognition of any length, CNN does not solve the sequence identification problem.

Another important branch of the neural network, the recurrent neural network (RNN), is mainly used to process sequence data. It only needs to convert the input image into the feature sequence of the image during preprocessing. But the preprocessing is independently of the others, so the RNN cannot be trained in an end-to-end manner. In order to construct an end-to-end system for sequence identification, some scholars have proposed combining CNN with RNN. This neural network model is named convolutional recurrent neural network (CRNN) [7, 8]. CRNN can not only perform end-to-end training, but also recognize sequences of any length.

Considering the complex background of the express bill, this paper proposes a method for first positioning and re-identifying the express end sorting label code region. In order to improve the recognition efficiency, the traditional digital recognition method is improved. The whole series of numbers is regarded as the sequence recognition problem. In this paper, the CRNN model is used. After the input image passes through the feature extraction network and the LSTM recurrent unit [9], the entire text image can be identified by the translation of connectionist temporal classification (CTC) algorithm [10]. The dataset is synthesized using SUN database images, of which 640 k images are used as training dataset and 7041 are used as test dataset, and the network model is trained on this basis. The results show that the proposed method improves the recognition accuracy and processing speed of express end sorting label code.

The rest of the paper is organized as follows. Section 2 briefly describes the work related to digital positioning and recognition. Section 3 proposes the network structure of this paper and explains each part in detail. The experimental results are presented in Sect. 4, and finally, Sect. 5 gives the summary of this paper.

2 Related works

Digital recognition is an important research direction of OCR. In this paper, express end sorting label code is regarded as a continuous sequence recognition problem, and without single character segmentation. A brief introduction to the previous works is given as follows:

  1. 1

    Traditional digital recognition methods include template matching [11, 12], structural feature [13, 14] and support vector machine (SVM) [15, 16]. The template matching method is to first establish a template for each character and then use the template to compare with the character to be recognized. The identification method based on digital structure features is to use the characters to be recognized to match the structural features of the original numbers. Although this method can effectively distinguish similar characters, it is susceptible to noise. The recognition efficiency of SVM is also fine, but it is limited to the case of fewer samples, and the method requires stricter input parameters, and the selected parameters have a greater impact on the recognition. However, the development of neural networks just solves the above problems.

In 1989, Lecun et al. [17] established the modern structure of CNN and designed a five-layer CNN structure Lenet-5, which solved the recognition of handwritten numbers. However, due to the lack of training data at the time and the processing power of the computer being not particularly good, Lenet-5 was not able to solve the digital recognition in complex background. Until 2006, Krizhevsky et al. [18] proposed an eight-layer CNN structure AlexNet, which overcomes the difficulty of the original CNN to solve deep training. With the maturity of deep learning technology, deep learning technology combined with digital recognition technology has achieved good results. For example, in order to identify handwritten digit strings of unknown length, Gattal et al. [19] proposed three combinations of vertical projection, contour analysis and sliding window Radon transformation to segment the digital strings and used SVM to identify and verify each segmented digital image. In [20], handwritten Chinese text recognition was performed using neural network language model (NNLM) and CNN shape model. CNN was used for over-segmentation and geometric context modeling, and NNLM was used for character recognition. In [21], the aspect ratio detection method was used to locate the license plate area, then vertical projection method was used to segment the license plate characters, and finally, the back-propagation neural network (BPNN) was adopted to train and recognize the license plate image.

However, the character segmentation operation in the above papers is susceptible to image shadows, scratches, noise and character sticking, which may result in poor character recognition. So, some scholars have begun to research the recognition of a whole series of numbers as a sequence learning problem without dividing the characters. This method is called end-to-end recognition [22]. Even in the case of many image interference factors, it is possible to obtain a better recognition effect in the presence of noise. For example, Li et al. [23] proposed an end-to-end deep neural network that inputs license plate images into the convolutional layer to extract regional features. The RNN for license plate recognition can share this feature. This allows the license plate to be positioned and recognized in a forward pass at the same time. Hu et al. [24] proposed a four-layer network model for end-to-end identification of Chinese text, CNN for extracting features, long short-term memory (LSTM) for processing sequences and a fully connected layer for predicting the probability of each character. CTC was used to calculate sample loss. Li. H et al. [25] proposed an end-to-end trainable text detector that can jointly detect and recognize words in natural scene images. The network consists of several convolutional layers, region proposal network, multi-layer perception, RNN encoders and RNN decoders. Wen et al. [26] proposed a deep neural network model CRNN to identify CAPTCHA. The verification code does not need any preprocessing and can be end-to-end identified, avoiding the traditional positioning, segmentation and other steps.

3 Express end sorting label code recognition system

3.1 Image collection and correction

During the actual operation of the express sorting system, the express placed on the sorting car may have a tilted picture due to the position difference, and the tilted picture needs to be corrected to improve the subsequent digital positioning and the accuracy of the recognition. Firstly, this paper collects several express images from the industrial scene. To protect the privacy of recipients and senders, their true information has been removed from all collected images. Then uses the Hough transform algorithm to correct the rotated image [27]. The result of image correction using Hough transform is shown in Fig. 1.

Fig. 1
figure 1

Results of image correction

Figure 1a shows the original image, Fig. 1b shows the corrected pattern obtained after the HoughLines transformation is rotated by a certain angle according to the linear inclination angle.

3.2 Express end sorting label code position

The accuracy of the coding region positioning directly affects the effect of subsequent recognition. However, due to too much interference information on the express bill, it is difficult to accurately locate the express end sorting label code region in one step, so the positioning of the coding region can only be realized in a rough to fine manner. Firstly, the method in [28] is used to reduce the interference by performing a noise reduction operation on the corrected picture. Secondly, the layout features of express bill are observed; each express picture has its corresponding QR code information. Therefore, this paper firstly uses the method in [29] to quickly detect the QR code in the image and get their respective locations in the picture. Then, the upper left corner and the lower right corner of the QR code are used as reference points. In this way, a rectangular frame can be constructed to frame the coding region to realize the positioning of the coding area. Figure 2 shows the results of positioning the coding region. The red and green bounding boxes, respectively, represent the recognition results of the QR code and the express end sorting label code.

Fig. 2
figure 2

Express end sorting label code region position

3.3 Express end sorting label code recognition

After the coding positioning is completed, it is necessary to recognize the coding sequence. The framework of the model structure is shown in Fig. 3. Firstly, the three convolutional layers are used to extract the feature sequence of each input image. Then, each frame of the feature sequence output by CNN is predicted by RNN, and LSTM network in the RNN is used in this layer. Finally, the prediction result of each frame by the RNN layer is converted into a real tag sequence by using CTC. Compared with the traditional method of first segmentation and re-identification, the performance of the entire network recognition is improved. Next, each layer of the network model is introduced in detail.

Fig. 3
figure 3

Framework of CRNN

3.3.1 Convolutional layers

Sequence features are extracted from the input image using a standard convolutional neural network at the convolutional layer of the model. The structure of the convolutional layer is shown in Fig. 4. The model in this paper uses the ReLu nonlinear activation function, which replaces all negative pixels with 0. Equation of ReLu function is shown in Eq. (1):

$$ f\left( s \right) = \hbox{max} \left( {0,s} \right) $$
(1)

where s represents the input variable.

Fig. 4
figure 4

Structure of convolutional layer

After the convolution and pooling operations, 256 image sequence features S = (s1s2s3, …, sT) of length 32*8 are output and the image sequence features are the input of recurrent layer. The label corresponding to each frame st in the image sequence feature is L = (l1l2l3, …, lT).

3.3.2 Recurrent layers

The input of the recurrent layer is the image sequence feature output by the convolution layer. It is to predict the label distribution lt corresponding to each frame st in the image sequence feature. In this model, the recurrent unit used by RNN is LSTM shown in Fig. 5. The LSTM consists of three multiple gates, the forget gate ft, the input gate it and the output gate ot. Each gate consists of an element-level multiplication operation and a Sigmoid network layer, which can selectively add or delete information operations on the LSTM cell state Ct.

Fig. 5
figure 5

Structure of LSTM

The first step of the LSTM is to filter out some of the information in the previous cell state Ct-1 by forget gate ft. The calculation formula of the forget gate is shown in Eq. (2):

$$ f_{t} = \sigma \left( {W_{f} *\left[ {h_{t - 1} ,s_{t} } \right] + b_{f} } \right) $$
(2)

where Wf and bf represent the network weights and offsets corresponding to the forget gate, respectively. ht−1 represents the internal state corresponding to the current moment t and st represents the image feature sequence corresponding to the current time t.

The second execution step of the LSTM is to update some new information to the current cell state Ct via the input gate it. The calculation formula for the input gate is shown in Eqs. (3, 4, 5):

$$ i_{t} = \sigma \left( {W_{i} *\left[ {h_{t - 1} ,s_{t} } \right] + b_{i} } \right) $$
(3)
$$ \bar{C}_{t} = \tanh \left( {W_{C} *\left[ {h_{t - 1} ,s_{t} } \right] + b_{C} } \right) $$
(4)
$$ C_{t} = f_{t} *C_{t - 1} + i_{t} * \bar{C}_{t} $$
(5)

where \( \bar{C}_{t} \) is a candidate value created by tanh that is selectively added to the current cell state Ct. Unwanted information in the previous cell state Ct-1 can be filtered out by \( f_{t} *C_{t - 1} \). \( i_{t} *\bar{C}_{t} \) determines the extent to which new information is updated.

The third step of the LSTM is to output the implicit state ht of the current time t through the output gate ot. The calculation formula of the output gate is shown in Eqs. (6, 7):

$$ o_{t} = \sigma \left( {W_{o} *\left[ {h_{t - 1} ,s_{t} } \right] + b_{o} } \right) $$
(6)
$$ h_{t} = o_{t} *\tanh \left( {C_{t} } \right). $$
(7)

The final step of the LSTM is to output the label distribution lt predicted by the current time t. Its calculation formula is shown in Eq. (8):

$$ l_{t} = \sigma \left( {W_{l} *h_{t} + b_{l} } \right). $$
(8)

3.3.3 Transcription layers

After a series of operations through the convolution layer and the recurrent layer, the length of the label distribution lt may not coincide with the real label length corresponding to the image feature sequence st which may result in training failure. Therefore, it is necessary to use the CTC algorithm to de-integrate the label distribution predicted by the recurrent layer into the final recognition result. The CTC algorithm is specifically designed to handle the misalignment of input and output labels. For example, the label of an input image is “950113003,” and after the operation of CNN and RNN, the length of the label distribution that may be output is 11. This means that a label b corresponds to a plurality of different character combinations π, and these character combinations π may be correctly translated into the label b and may also be translated incorrectly. In order to solve this problem, the CTC introduces a blank mechanism that adds a separator between successively recognized characters, denoted by #, to distinguish between the correct label sequence b and the continuously recognized label sequence. Therefore, the above input label “950113003” after the introduction of blank character combination π can have “99#5011#1#30#03,” “950#111#13##00#03,” etc.

After obtaining various combinations of π, it is translated into the final recognition result. Therefore, this paper adopts a mapping method π = B(b) to remove the continuously recognized characters and separators in the output label sequence. In the case where an output label sequence is known, the formula for obtaining the sum of the probabilities of the combination π of the correct label sequence b is shown in Eq. (9):

$$ p\left( {b|L} \right) = \mathop \sum \limits_{{\pi \in B^{ - 1}_{b} }} \mathop \prod \limits_{t = 1}^{T} y_{\pi t}^{t} $$
(9)

where T represents the length of the label sequence L, y represents the probability of output of each frame in the image feature sequence, πt represents the label combination output at time t and y tπt represents the probability of outputting the label combination πt at time t. \( p\left( {b|L} \right) \) also refers to the loss of a single label sequence b. What CTC has to do is to optimize this loss to a minimum using a negative maximum likelihood. The CTC loss function \( {\text{ctcloss}} \) is defined as shown in Eq. (10):

$$ {\text{ctcloss}} = - \ln \left( {p\left( {b|L} \right)} \right). $$
(10)

3.4 Algorithm implementation

The algorithm flowchart of the CRNN model proposed in this paper is shown in Fig. 6.

Fig. 6
figure 6

Algorithm flowchart of CRNN model

Specifically, the model is divided into three parts: feature extraction network, recurrent network and CTC network. The feature extraction network uses 256*64 pictures as input, and each frame \( s_{t} \) in the extracted feature sequence is the output and the input of recurrent network. The output of the recurrent network is the label distribution \( l_{t} \) predicted by \( s_{t} \) at the current time. The CTC network translates \( l_{t} \) with the blank mechanism. In the training process of the model, the weights and offsets in the network are first randomly initialized to enable the model to perform subsequent iterative training. Then, in each subsequent iteration, each iteration is performed, and the weights and offsets generated by the previous iteration are overwritten. The number of this paper sets the number of iterations to 10000 and saves the last generated layer weights and offsets into the data file, so that in the subsequent real-time digital recognition, the training process is omitted, and the data file is directly read to achieve the purpose of recognition and increase the speed and accuracy of recognition.

4 Experiments and results

The CRNN model proposed in this paper is developed by Python and runs on Intel Xeon machine with 2.40 GHz CPU and 32 GB RAM.

4.1 Datasets and performance

4.1.1 Datasets

The dataset used in this paper is a set of datasets synthesized by SUN database background image and digital. The size of each image is 256*64. The express end sorting label code style is strictly in accordance with the coding specifications of Xinjiang STO Express and YT Express. There are two kinds of coding forms and two different fonts; no matter which coding form is input, the recognition result is 9 digits. The training dataset contains a total of 640 k training images. They were tested on two different datasets. Some of the training images are shown in Fig. 7.

Fig. 7
figure 7

The training samples of synthesized dataset

4.1.2 Performance on Free-Type dataset

In this part, 7200 images generated by the Free-Type library were used to test the recognition performance of the model in this paper. While generating a digital picture, it is possible to add a label corresponding to the content of the generated digital string to each picture, and interference factors such as noise. This article uses the Matplotlib visualization tool to display the experimental results graph. Through the experiment, the results obtained are shown in Fig. 8 and Table 1.

Fig. 8
figure 8

Comparison plots of the two methods on Free-Type dataset

Table 1 The results based on speed and accuracy

Figure 8a represents that when using the method in [20], the recognition accuracy tends to be stable when the model is iterated to 500 epochs, and it remains at around 95.37%. Figure 8b shows that when using the model in this paper, the model is iterated to 8100 epochs, and the recognition accuracy reaches 96.69% and tends to be stable. As given in Fig. 8 and Table 1, the CRNN model is better for identifying the coding sequence.

4.1.3 Performance on SUN-synthesized dataset

In this part, 7041 composite images from the SUN database background image and digital were used to test the recognition performance of the model in this paper. For the method of first dividing a single character and then identifying and end-to-end recognition, the obtained result is shown in Fig. 9 and Table 2.

Fig. 9
figure 9

Comparison plots of the two methods on SUN-synthesized dataset

Table 2 The results with different methods

Figure 9a represents that when using the method in [20], the recognition accuracy tends to be stable when the model is iterated to 650 epochs, and it remains at around 95.26%. Figure 9b shows that when using the model in this paper, it is tested on SUN-synthesized test dataset. The model is iterated to 8260 epochs, and the recognition accuracy reaches 96.18% and tends to be stable. As given in Fig. 9 and Table 2, the method in this paper is better.

4.2 Experimental results and analysis

In this part, this paper obtains some express images from the on-site working environment of the express end sorting system. The code recognition result is displayed above the located code area. The recognition result for the express end sorting label code is shown in Fig. 10, and the results will be discussed in the following.

Fig. 10
figure 10

The result of express end sorting label code recognition

Figure 10 shows the recognition result of the express end sorting label code. The red and green bounding boxes, respectively, represent the recognition results of the QR code and the express end sorting label code. In the case that the coding area is not blocked, and the line of sight is good, or the scratches are not too serious, the CRNN proposed in this paper can always correctly recognize the express end sorting label code. However, in the express bill pictures numbered 334 and 368, since the characters are partially occluded, the coding area is painted too much, and the model gives the wrong recognition result. In 334, the original “962460039” is recognized as “762460039.” In 368, the original “962460016” is recognized as “962400016.” But in general, the model CRNN in this paper still has better recognition performance and can handle code recognition under the conditions of rotation, light scratches and the like.

5 Conclusions and future works

This paper proposes the CRNN network model to recognize the express end sorting label code for the electronic express end sorting label code standard of two express companies in YT Express and STO Express in Xinjiang. It has certain practicability. Due to the complexity of the information on the courier, the identification of the code uses the method of first positioning and identification. The core contribution of the proposed algorithm is to remove the character segmentation step in the traditional digital recognition method and to simplify the convolution layer and the loop layer in the model of the recognition algorithm. Therefore, the coding recognition process is simplified to a certain extent, and the influence of improper character segmentation on the code recognition is also reduced. Experiments on two datasets show that the proposed algorithm has better recognition effect on the code. But there are also certain deficiencies. In the case where the factors such as scratches and occlusion are obvious, the recognition effect is not good. In the future, image inpainting technology [30] can be used to achieve a higher recognition effect.