Scene text recognition via context modeling for low-quality image in logistics industry

Text recognition has been applied in many fields recently, such as robot vision, video retrieval, and scene understanding. However, minimal research has been conducted in the field of logistics wherein images of express sheets captured by cameras are mostly curved, distorted, and have low resolution. In this study, a new method is proposed to address the aforementioned research gap while simultaneously considering irregular and low-resolution English letters. The entire approach comprises a rectification module, a convolutional neural network (CNN) extractor, a semantic context module (SCM), a global context module (GCM), and a lightweight transformer decoder that can exhibit improved training speed. In particular, we propose the idea of context modeling in our proposed method. (1) The proposed SCM is introduced to capture full-image dependencies and generates rich semantic context information. (2) We propose the GCM, which not only enhances long-range dependencies from the output of SCM but also outputs abundant pixel information to the self-attention decoder. (3) To solve the low-resolution text recognition problem in a large number of express sheet scenes, we propose Chinese datasets for improving intelligent logistics. Experiments conducted on six public benchmarks demonstrate that the developed method achieves better robustness to low-resolution and irregular text images.


Introduction
This paper introduces a new text recognition technique for the logistics industry. As we know, logistics information plays the role of package transportation and marking the important information for the purpose of contacting customers. In the past 10 years, thanks to the improvement of internet technology, China's e-commerce industry has developed rapidly, which has also driven the development of the logistics industry. The huge domestic purchase demand generates more than 100 million express packages every day. The logistics industry provides logistics service for all walks of life and promotes economic development. However, how to handle such a huge number of express packages every day? In addition to various automation technologies, scene text recognition (STR) technology also plays an important role; STR not only plays the significant role of intelligent sorting in distribution centers, but also customer information can be extracted quickly in the end distribution, which replaces a large number of manual operations and gives companies great cost savings, because labor costs are getting higher and higher. In the distribution center, the express package obtains the destination code through STR technology, and will be intelligently transmitted to their respective concentration areas, waiting to be transported to the destination. In our survey, although text recognition research has been applied in many fields, such as bank note recognition, license plate recognition, and train ticket recognition, while minimal research has been conducted in the field of logistics wherein images of express sheets captured by cameras are mostly curved, distorted, and have low resolution. We aim to improve the low-quality text recognition for logistics industry, especially the curved and low-quality images. Generally, the STR model consists of four parts [1]: the rectification, feature extraction, sequence module, and decoder module. A spatial transformer network (STN) [2] is utilized as a rectification module to reduce pressure in the next stage. Second, feature extraction networks, which are implemented by deep learning networks, such as a residual neural network (ResNet) [3], have achieved remarkable progress. The prediction module typically adopts a recurrent neural network (RNN) [4] to sequence text order. An RNN is a sequential model that can process sequential data; however, it is limited by slow training time and lacks long-range dependencies. Long short-term memory, which is a variant of RNN, is the most important solution to alleviating these limitations. Nonetheless, it disregards parallel computing and is even slower than RNN [5]; hence, a self-attention module is introduced to realize parallel computing given its capability to model long-range dependency relationships. Finally, the decoder module has two options: the connectionist temporal classification decoder [4] or the attention-based decoder [6]. Although, the combination of these modules pushes the development of STR, many challenges, such as irregular and seriously curved and lowresolution text images, still exist. Text in the express sheet image shown in Fig. 1 is extremely blurred and some Chinese characters cannot be recognized because of the motion blur captured by the camera.
To solve the huge demand of logistics industry, two aspects of text recognition technology need urgent attention; one is the recognition accuracy rate and the other is the recognition speed. Currently, the recognition accuracy rate still needs to be improved in logistics industry. The challenge comes from two parts: on one hand, the logistics sheet is prone to frictional deformation during transportation, and then becomes a low-quality image. On the other hand, these images are captured by cameras, which are prone to blurred and lowresolution images after being compressed. To address this issue, we conduct research from both datasets and method. In terms of datasets, at present, Chinese datasets for real scene text recognition remain insufficient. Most text recognizers are trained on benchmark datasets to compare the accuracy of six public evaluation databases. However, these public training datasets do not have Chinese characters. We know that there are 26 English letters, but there are more than 5000 commonly used Chinese characters in Chinese scenes. Moreover, these images for training should be as close as possible to real scene images, that is, these images should include blurred, missing-stroke, and motion-blurred images to meet performance requirements in Chinese scenes. Hence, we propose a novel Chinese dataset for recognizing low-quality images in Chinese scene text recognition. In terms of method, to improve the recognition accuracy of complex logistics images, we adopt the transformer model to improve global computation and information-carrying capabilities, the transformer proposed by Vaswani et al. [7] used a self-attention network as its encoder and decoder module to extract the relationships between characters. The major advantages of transformer-based methods over other models are their global computation and perfect information-carrying capabilities; in recent research, models based on transformers have been applied to computer vision tasks [8][9][10][11]. However, transformer [7] is composed of a stack of six identical layers, respectively, in its encoder and decoder module, which makes the model too heavy and slower. To meet the speed requirement of 10 FPS (frames per second) in logistics industry, we simplify the architecture of transformer [7], and we first removed the huge encoder part in the transformer, and only used the decoder part of transformer as our encoding part with only 3 identification layers, while [12,13] use 12 identification layers in encoding part and 6 identification layers in decoding part for text recognition, although they achieved remarkable results on public datasets; however, they ignore the requirement of speed in the practical application, and [12][13][14][15] all adopt the whole transformer network as the architecture of their methods for text recognition. Without the encoder in transformer, we add two small modules as a complement. First, we designed a residual crisscross attention in the semantic context module (SCM) to capture full-image dependencies and generate rich semantic context information, and we quit recurrent crisscross attention [16], because the residual network which is similar to ResNet network [3] can effectively solve the feature degradation and establish a long dependency of text information in logistics sheet image, especially the address information, because the address information often includes the province, the city, the country, and the detailed home address which is very long text sequence. Second, we propose a smaller module global context module (GCM) to improve global computation and information-carrying capabilities. To avoid character missing and attention drift [14] when recognizing long text sequence in practical scenes, especially the large number of address information in logistics sheet images, we add the learnable character weight information into self-attention module of GCM, which can emphasize the character region in the text sequence and prevent the feature from weakening during the operation. Considering transformer-based methods are not as powerful as CNNs in terms of acquiring local information, to address these disadvantages, we utilize a ResNet-based extractor as part of our proposed method in the current study.
The major contributions of this study can be summarized in four aspects: 1. We construct a new Chinese dataset for text recognition.
This dataset is described in detail in the section "Introduction of datasets", and it focuses on solving the problem of recognizing low-quality Chinese text images in logistics industry. 2. We propose a novel end-to-end text recognition method that adds the idea of context modeling throughout the entire network, and adopts a lightweight transformerbased decoder to accelerate training speed. The proposed method achieves competitive results on several benchmark datasets due to its information-carrying capability. 3. To enhance semantic information and obtain richer sematic context information, we propose a semantic context module (SCM) with a residual module to capture full-image context information. 4. Replacing the RNN structure, we propose a GCM that is designed to grasp long-range dependencies between pixels and capture essential information for global understanding. To avoid character missing, the learnable position information was added into self-attention module, which can emphasize the text regions. The GCM is introduced to connect the SCM and the transformer-based decoder, playing a role in transmitting rich global context information to the next stage.

Related works
There are three main difficulties in current text recognition technology, one is the existence of a large amount of curved text images in natural scenes, the second is the existence of a large amount of low-resolution text images, and the third is the requirement of faster recognition speed in practical application. Many related works have focused on these challenges.
For curved texts in natural scenes, considering that if the entire text area is sent to the text recognizers, the recognition effect will be poor due to a large number of invalid areas. Shi et al. [17] proposed an automatic rectification module to solve the curved text, and the image is transformed into a regular and more readable image through the spatial transformation network. This transformation can rectify different types of irregular text to improve the recognition ability of curved text. To simplify the complexity of the rectification module, Shi et al. [18] further proposed a new rectification network, namely, a thin-plate spline transformer based on an attention mechanism was proposed for recognizing irregular text, the model learns directly to control the curved text point, and then straightens the curved text into horizontal condition. Comparing [17,18], Zhan et al. [19] proposed the repetitive rectification frame which can continuously rectifies the text regions to improve the text recognition performance, while [17,18] rectify the text regions only once. Although these methods made huge contributions to curved text recognition, they did not yet solve the low-quality text images.
To solve low-quality text recognition, Zhang et al. [20] proposed a sequence-to-sequence domain adaptation network for recognizing blurred text images. Gate attention similarity units were used to adapt the distribution of attentional information to the corresponding sequence data, acquiring better feature information. Given that scene text methods seldom combine visual and semantic information, and most methods rarely use semantic information to assist in text recognition, Wan et al. [21] combined the mechanisms of attention and segmentation models to provide better results, considering an attention-based model performs weakly when dealing with low-resolution images, a segmentation model can better use visual features. Yu et al. [22] proposed a novel semantic inference network (SRN) to capture abundant semantic text information, which can effectively combine visual context information and semantic context information, and the proposed method segmented characters one by one, and then spliced them into horizontal states, achieving better performance on public datasets. Inspired by language models which can effectively supervised word sequence generation, Qiao et al. [23] added a pretrained language model to supervise encoding and decoding procedures, supervising training of predicted semantic information through a pretrained word embedding model; this scheme is better for solving lowresolution text recognition. Zhang et al. [24] used a text recognition framework for automatic search. The base framework can be adjusted for different datasets as inspired by a neural framework search. These models integrated multiple modules and tended to be large models, taking into account the recognition of low-resolution text; furthermore, considering the speed requirement of the text recognition, some studies have also focused on the speed of text recognition as follows.
In terms of text recognition speed, Liu et al. [25] found that text features can be encoded into binary format without loss of semantic information, and real-time inference can be achieved by binary compression; meanwhile, the memory consumption can be very small; however, its recognition performance on benchmark datasets is not ideal, although the speed has better performance. Taking into account the transformer-based methods have the advantages of parallel computation in the field of natural language processing, Zhu et al. [15] directly combined heavy backbone and the whole transformer network as a new text recognizer, and a hierarchical attention mechanism including four self-attention blocks was proposed in the encoding part to describe context information, which is also a heavy model. To make text recognition technology more suitable for real-world applications, Li et al. [26] proposed a transformer-based model, a self-attention mechanism was applied to text recognition to reach a better ideal state. This model utilized localitysensitive hashing instead of softmax regression to compress model. Parameters are reduced, the small model achieves better speed, but its recognition performance is not ideal comparing other text recognizers, and it is not effective for low-resolution text recognition. Recently, based on transformer network, Lee et al. [12] proposed an adaptive 2D positional encoding to strengthen feature extraction and further solve irregular images, and it achieved remarkable results on public datasets, especially in irregular datasets; however, it has 12 self-attention layers in its encoding part, while transformer [7] only has 6. To enhance the performance of the feed-forward network (FFN), Kim et al. [13] modified the encoder part of the transformer by replacing with a squeeze-and-excitation FFN. However, they also adopted the overall framework of the transformer, and the inference speed did not yet meet the requirement when facing logistics scenes. In general, with self-attention modules, the transformer-based methods can achieve the state-of-art effect on text recognition comparing simple CNN-based text recognizers. However, as we know, the transformer has better performance, it is also a large model; hence, speed should be taken into consideration when facing logistics scenes. For some special scenes, text recognition technology is more and more related to real industrial applications, Ren et al. [14] applied the transformer network into text recognition to solve the shopping receipt scene; to solve the problem of attention drift in transformer network, and they proposed

Introduction of datasets
To recognize express sheet information better, we construct a large-scale dataset that contains more than 5000 commonly used Chinese characters. The corpus contains Chinese address information of most cities and counties in 32 provinces in China. This information is collected in the business system of a Chinese express company. The dataset, which contains 5 million images, is randomly divided into two parts: 50,000 images are in the fixed testing sets and the remaining images are in the fixed training sets (Table 1). We construct the dataset by utilizing several functions, such as blur, image angle transformation, motion blur, and missing strokes. The principle is performed as follows.

Blurred images
Blurred images exist widely in real scenes, and many words are difficult to identify. We use different kernels to blur images randomly, and an example is provided in Fig. 2a.

Image angle transformation
Image angle transformation consists of three algorithms, namely, random angle transformation, affine transformation, and perspective transformation, the angle of which changes from 15°to 45°. A corresponding example is presented in Fig. 2b.

Motion blur
In a real scene, the quality of pictures captured by cameras can be degraded because of motion blur. The purpose of motion blur is to blur an image in the left and right directions. The middle column of a kernel is numerically processed and used as a new convolution kernel for operation. The effect is illustrated in Fig. 2c.

Missing strokes
Many cases of missing strokes occurred in real scenes, because the images are easily rubbed. This situation poses huge challenges to text recognition, particularly for images with more than 5000 commonly used Chinese characters. We adopt the following method in the current study. We randomly replace the background pixels of the text area with background color, which is typically white, to achieve the effect of missing strokes. The effect is depicted in Fig. 2d. The detailed procedure for missing strokes is summarized in Algorithm 1.

Approaches
The proposed model is divided into five parts: the rectification network, the CNN extractor, the SCM, the GCM, and the transformer-based decoding module. To overcome the influence of irregular images, we first resize the images into 32 × 100, and then, the shape is reshaped into 32 × 64 to find the control points in the rectification module. These points are used to make the text line horizontal. Furthermore, a CNNbased module is utilized to extract 2D features instead of 1D features which are typically extracted using traditional recognizers. The output of the last layer of the CNN module is 4 × 25, where 25 is the width of the feature map that can preserve more long horizontal pixel information and is beneficial for long text image recognition. We also adopt the SCM to obtain richer semantic information and generate abundant semantic context features. The GCM is added after the SCM to construct long dependencies as the role of sequence context modeling. Subsequently, we also adopt a lightweight transformer decoding part, with N = 3 blocks, and each block includes a masked multi-head attention mechanism, a multihead attention mechanism, a feed-forward network (FFN), and layer normalization. The overall framework of the entire network is illustrated in Fig. 3.

Rectification network
The rectification network framework in [2], which is divided into three parts: a localization network, a grid generator, and a sample generator. The localization network is utilized to detect the control points in the graph and output their location. The grid generator calculates the mapping relationship of each point that corresponds to the control points and generates the coordinate position of points {P1 . . . Pn}. The sample generator samples the point positions calculated by the grid generator to generate the rectified graph. Finally, the original image is rescaled to a fixed size.
The 's' means the stride of the first convolutional layer of each block

Backbone network and feature enhancement module
The recognition network is based on Yu et al. [22], which used ResNet50 [28] to extract text image features. The output shape of every layer is provided in Table 2. The shape of the last layer of the extraction module is 4 × 25 × C, where 4 and 25 is the height and width of the input image, respectively. To obtain richer semantic information, we first propose the feature enhancement module to fuse high-and low-level semantic information. The features of layers 3, 4, and 5 extracted by the CNN network are fused into one feature map.

SCM
The SCM was designed as a residual crisscross attention module, which is shown in Fig. 4, and it contains two convolution modules which play the role of enhancing feature extraction. We quit recurrent crisscross attention [16], because the residual network which is similar to ResNet network [3] can effectively solve the feature degradation and establish a long dependency of text information in logistics sheet image, especially the address information, because the address information often includes the province, the city, the country, and the detailed home address which is very long text sequence.
The crisscross attention [16] proposed for semantic segmentation can capture context information in a more efficient manner, which is presented in Fig. 5.
The crisscross attention is first applied on X to output Q and K, and {Q, K } ∈ R C ×W ×H .C is the channel of the feature map. Then, attention map A is generated by affinity operation, which is expressed as follows: where j represents each position of Q, and Q j ∈ R C ; K i, j is the i-th element of K j ; d i, j ∈ D is the degree of correlation between Q j and K i, j ; and then, softmax was applied to D to compute attention map A.
Meanwhile, X generates V as same operation as Q, K. Context information is enriched via aggregation operation. However, single crisscross attention can only capture horizontal and vertical context information. To address this issue, we proposed a residual crisscross attention module that can capture full-image context information to distinguish better between text and non-text areas, particularly for blurred and low-resolution images. Generally, the SCM receives X as input and outputs feature map X , which can be computed as where CCA (•) denotes the crisscross attention function; and ⊕ denotes element-wise addition.

GCM
The GCM includes a context modeling unit, a self-attention part, and a position encoding module, as shown in Fig. 6. Inspired by [30], we adopt the global context block as the context modeling unit that can model long-range dependencies. We utilize element-wise addition as the connection, which can broadcast attention values accordingly during addition, enabling focusing on the collection of crucial clues about unique character features. To solve the long text sequence recognition and attention drift, we add the learnable character weight information into self-attention module of GCM, as follows.
After global context block, we compress the height of feature map into one, and get a feature sequence, and this feature sequence represents the text sequence. The traditional image algorithm processes the image into a binarized feature map, so that the pixel value in feature map is in the [0,1] interval, where the closer the value of X i, j is to 1, the higher the probability that it is a text pixel, X i, j denotes the pixel value in binarized map.
As shown in Fig. 7, we use a more optimized function named f (x) for binarization processing whose function value range is in the [0,1] interval, the function is a differentiable function, and the optimal network parameters can be learned during the training process. A better threshold can be learned during training, enabling the feature to be dynamically processed into a binarized feature map, so that the text region is more obviously be presented. The function is as follows: where X i, j denotes the pixel value in feature map after global context block. After Eq. (3), we get the binarized feature map = (θ 1, . . . , θ L ), where L represents the width of the feature map, Fig. 7 Illustration of f (x) which is utilized for binarization processing, where the closer the value is to 1, the higher the probability that it is a text pixel, we set the value of k to 80 the closer the pixel value in feature map is to 1, the higher the probability that it is a text pixel, the binarized feature map plays the role of positional weights. Inside the self-attention module, we get the text sequence map h = (h1, . . . , h L ) after the operation of self-attention; meanwhile, we added the learnable position information into feature map h; the operation is as follows: where * represents the Hadamard product. By assigning position weight information to the selfattention mechanism, the disappearance of characters in the operation can be avoided, and the character area is more prominent; if pixel value in binarized map is close to 1, it emphasizes that this position is a text region.
In contrast with the traditional RNN module, GCM can be trained in parallel and exhibits a strong long-range dependency capacity.

Decoder
Through the GCM, the features that we acquired have better global information. Considering its good performance in the field of natural language processing (NLP), we directly select transformer-based blocks as the decoder instead of traditional attention-based decoder. Compared with the fundamental block in [7], we simplified the decoder part of transformer with only three identification layers, which is a lightweight decoder. Apart from the multi-head attention and FFN, a masked multi-head attention is also inserted and utilized to simulate dependencies between different decoding locations. In the decoder part, residual connections [29]  around each sublayer are also used before layer normalization, as shown in Fig. 8, In the current study, we use H = 8 parallel attention in the multi-head attention mechanisms. The output of each sublayer can be summarized as Layer-Norm(x + Sublayer(x)), where Sublayer(x) is a function that includes the multi-head attention and FFN.

Positional encoding
The transformer eliminates RNN, the greatest advantage of which is the abstract memory of data in a time series. Without position embedding, the corresponding position values of the attention map are swapped after exchanging character positions, leading to no sequence order information. Hence, we introduce the following positional encoding method [7] to add relative position information. The formula is as follows: PE (pos, 2i) = sin pos/10000 2i/d model , PE (pos, 2i+1) = cos pos/10000 2i/d model , where pos is the position, indicating the token in the sentence. d model = 512, i = 0, 1, …, 255.

Masked multi-head attention
Masked multi-head attention is used to simulate dependencies between different decoding locations. It plays an effective role in ensuring that the prediction of a time step is only followed by its previous time step information. A mask is utilized to stop each position from occupying positions after that particular position.

Multi-head self-attention
Compared with the simultaneous use of multiple filters in CNN, multi-head attention helps a network capture richer feature information and enables the model to pay more attention to different aspects of information. Multi-head attention can be summarized as [7]: where Attention is calculated using scaled dot-product calculation, which is faster and reduces computational stress after scaling. The scaled dot product accepts three parameters as input: query q λ i ∈ R d , i ∈ [1, λ], and a set of key-value pairs of d-dimensional vectors {(k i , v i )} i=1, 2, ..., N (λ denotes the number of queries, and N indicates the number of key-value pairs). Hence, the scaled dot product can be described as follows: where α is the attention weight,

Loss function
Finally, the cross-entropy loss function is adopted as the objective function of our model: where i refers to the index of the decoding position, y denotes the ground truth label, and s denotes the recognition result.

Commonly used evaluation datasets
To evaluate the robustness and effectiveness of the proposed model, we analyze its performance on the following six public databases: IIIT5K [31], which includes 3000 images for testing. The images are street view and digital images.
SVT [32], which includes 647 images for evaluation, which are processed into a blurred and low-resolution condition using certain methods for evaluation.
IC13 [33], some of which images are inherited from IC03 [34], and we remove non-alphanumeric images without dictionary linkage.
IC15 [35], which is collected by Google Glasses, is the most complex dataset in recent years. Most images have varying levels of distortions and blurring.
SVT-P [36] contains 645 cropped images from Google Street View that are randomly distorted and have different perspectives, which is used for evaluation.
CUTE80 [27] includes 288 cut images for evaluation. Most of the images have a complex background, perspective scrambling, and low resolution.

Data augmentation
To increase image diversity and achieve a low-quality image condition, we implement some methods on the original images, including random angle, affine, and perspective transformations, to realize image rotation and scaling. We Fig. 9 The effectiveness of the rectification module in irregular images which are from CUTE80 also use two methods, namely, gaussian blur and motion blur, to add noise to the original images and blur the images.

Implementation details
Our method is trained on SynText [37] and Synth90K [38], with approximately 12 million images as the training set. The experiments are conducted on NVIDIA V100 GPU with 32 G memory. We use Adam [39] as the optimizer, with a learning rate of 0.001. The final decay is 0.0005, and the batch size is 512.

Effectiveness of the rectification module
Considering that real scenes contain many irregular images, we select a representative dataset, CUTE80, as the testing set to verify the effect of the rectification module. As shown in Fig. 9, examples 1-6 effectively rectify irregular images into a horizontal state. For example, example 6 "football" is nearly rectified into a horizontal state by the rectification module.
To verify the role of the rectification module on the recognition capability of the overall network, we add and remove the rectification module from the model. The performance comparison results are presented in Table 3. After adding the rectification module, accuracy achieves 0.7%, 0.5%, 1%, 1.3%, and 1.4% improvement on the SVT, IC13, IC15, SVTP, and CUTE80 datasets. The data also show that the rectification module exerts a significant effect on the recognition of curved and irregular text.
We also perform an experiment to observe the feature of the first layer in the CNN extractor module. The heat map also demonstrates that the rectification module exhibits an evident effect on irregular images. An example is presented in Fig. 10, the raw image is presented in Fig. 9, Example 6.

Effectiveness of the SCM
To verify the role of the proposed SCM, we add and remove the SCM to form two different models for comparison. Then, we select SVT, SVTP, and CUTE80 to comprise the testing set. Among them, the SVT dataset contains many severely blurred images, while SVTP has many low-resolution images. Moreover, CUTE80 contains many irregular images, some of which images are also in lowresolution condition. The experiment proves that adding the SCM can effectively improve the recognition of low-quality images, achieving 0.4%, 3.1%, and 3.5% improvement on the SVT, SVTP, and CUTE80 datasets, respectively, as indicated in Table 4. The results demonstrate that the SCM can   . 11 Feature heat maps generated with SCM and without SCM improve the performance of the entire network, enriching semantic information. We select some challenging images to evaluate the effect of the SCM. The feature heat maps generated with and without the SCM are shown in Fig. 11. Both heat maps are overlaid with the feature maps after the first layer in the CNN extractor to show text information. Evidently, the attention features are more robust concentrated on the text area with SCM. The images are clearer and have more abundant text information compared with the condition without SCM, particularly when the image is irregular and blurred, such as "CLUB" in Example 1, Fig. 11a, and the "CUISINE" in Example 3, Fig. 11c. It indicates that SCM can help to distinguish better between text and non-text areas.

Effectiveness of the GCM
To verify the effectiveness of the GCM in sequence context modeling, we discuss the effects of adding and removing the GCM on the model. The number of fundamental blocks in the self-attention module is also discussed. We first remove the GCM. The accuracy indicated in Table 5 shows that the performance deteriorates when the GCM is removed. Moreover, the data show that the model reaches its optimal state when the number of blocks is one. Moreover, when considering the training speed in the entire network, the multi-head H hyperparameter is set to 8 which is best choice. When the number of blocks is increased to three, the results are relatively close to that when the number of blocks is 1. Hence, increasing the number of blocks makes no difference. The recognition rates achieved 2%, 0.4%, 1.5%, and 2.5% improvement on the IC15, CUTE80, IC13, and SVTP datasets, respectively, comparing with the condition without GCM. This finding implies that adopting the GCM in sequence context modeling helps improve the performance of the model, enhancing the capability for long-range dependencies.

Influence of the number of fundamental blocks in the decoder
To evaluate the hyperparameter in the decoder part, we examine its performance by testing its accuracy on IC13 and CUTE80. As indicated in the Table 6, we set the number of blocks to 1, 3, and 6, respectively, and we set the number of multi-head hyperparameters to 8 and 16, respectively. Evidently, the model reaches the best state when the number of blocks is three and the H hyperparameter of multi-head is eight. When we increase the number of fundamental blocks, Bold means the accuracy we got when the number of the block is set to 1 and the H hyperparameter is set to 8, and the setting is the optimal scheme in our method "n-GCM" denotes the GCM has n fundamental block in self-attention module Bold means the accuracy we got when we set the number of blocks to 3 and set the H hyperparameter to 8, and the setting is the optimal scheme in our method the results become worse and the training speed becomes slower, implying that a higher number is not necessarily better. This finding is consistent with the experimental results in [7], and a smaller number of blocks will accelerate training speed.

Comparisons in our proposed dataset
Most text recognizers are trained on benchmark datasets, such as SynText, Synth90K, and SynthAdd, to compare the accuracy of six public evaluation databases. However, these public training datasets do not have Chinese characters. Hence, we have to utilize Chinese datasets for Chinses scenes, and we use our proposed dataset as training set to compare the ability of text recognizers.
To demonstrate the performance of our proposed model in our proposed dataset, we compared it with Aster [18], and DAN [42]. They were trained on the training sets of our proposed dataset. All of these models were tested on the testing dataset of our proposed dataset, and the results are presented in Table 7, where Data-Aug denote the data augmentation function.

Fig. 12
The comparison between DAN and our model in real express sheet images; the left side stands for the recognition results of the DAN and the right side shows the recognition results of our proposed model As shown in Table 7, the comparison of Aster and our model showed that our proposed model produced better recognition results with a 6.9% improvement in the proposed datasets. Our model also performed better than DAN with a 2.8% improvement.
To explore the capability of our proposed model on recognizing Chinese real express sheet images, we compare it with DAN, and our proposed approach and DAN are first trained on our proposed dataset with the same settings, and then tested on real express sheet images. The recognized text images are cut from the real express sheet images. Among them, the left side represents the recognition results of DAN, while the right side shows the recognition results of our proposed model. As shown in Fig. 12, we can obviously observe Fig. 13 The sample of express sheet image which we mosaicked the sensitive information such as name and telephone num for preventing disclosure of private customer information that DAN erroneously recognizes the characters in red font. Our model gave the correct results in yellow font. By contrast, our proposed method provides better results when dealing with these challenging images in express sheet images.

Recognition of real logistics sheet images
Each express package contains an express sheet, which includes some important shipping information and customer information, such as sender information, recipient information, and shipping destination code, as shown in Fig. 13. Sender information and recipient information both have the information of address, name, and telephone num. These information plays the significant role of intelligent sorting in distribution centers, but also customer information can be extracted quickly in end distribution.
We first use the text detector to detect the key information which we want to recognize on the express sheet images, and then use our proposed text recognizer to predict the results.
We choose two difficult scenes to verify the performance of our model, and our proposed model was trained on our proposed Chinese dataset and tested on real express sheet image.

Curved express sheet recognition in end distribution
To test the recognition performance of our model for curved text, we selected an express sheet image, as shown in Fig. 14; the left part is express sheet image and the right part is the recognition result in red box. We first detect each region which we want to recognize by a text detector, these regions were detected in blue boxes, and then input each region to our proposed text recognizer. Figure 14 shows that our model performs well for curved text, as well as for multi-scale, long and short text. Although sender information is in the blurred and small conditions, our proposed model still ensures the recognition accuracy. The recognized text contains English characters, numbers, and Chinese characters. This finding shows our model can solve the curved express sheet images with the rectification module.

Blurred express sheet recognition in end distribution
We likewise selected the blurred, low-quality express sheet image to test our model, and we picked the logistics sheet image which is shown in Fig. 15; the upper part is the express sheet image and the below part is the corresponding result. From Fig. 15, we can observe that our model misidentified two characters in small red box of recognition part, because they are too blurred. Despite that, our proposed model can still handle the remaining blurred characters, including English characters, numbers, and Chinese characters. By experimental analysis, unless the character is too blurred and in the most challenging condition, our Fig. 14 The sample of curved express sheet image and its recognition result. The program was conducted on Python. We mosaicked the sensitive information such as name and telephone num. The left part is express sheet image, and each recognition region was detected in blue box, the right part is the recognition result in red box Fig. 15 The sample of blurred express sheet image and its recognition result. We mosaicked the sensitive information such as name and telephone num Fig. 16 The raw logistics sheet image captured from camera in distribution center proposed model can ensure the accuracy of recognition; it indicates that the idea of context modeling plays an important role in enhancing sematic information and the capability for long-range dependencies.

Experiments on multi-challenging express sheet images from distribution center
At present, over 100 million express packages are handled every day in China. With the development of the e-commerce industry, the logistics industry still has a promising future, but artificial intelligence technology should be introduced to improve the automatic sorting and deliver packages. The logistics industry adds text recognition technology to the automatic sorting line, replacing manual labor to realize the automatic distribution. To test the effectiveness of our proposed model, we select challenging images in the distribution center to test the performance of the method. The images from end distribution were collected by personal phone, such Fig. 17 The recognition of regular image scene, the left side is the detected logistics sheet image, and the right side is the recognition result corresponding to the text area one-to-one as Figs. 14, 15, all the express sheet images in distribution center are captured by camera like Fig. 16, and we first crop the sheet region by our detector. Generally, we divided these images into the following six scenes:

Regular scenes
This kind of image can be seen clearly which is shown in Fig. 17, the whole image does not suffer from damage, and text regions are very clear. When we recognized these images, we first use text detector to detect all the key information in the image, as shown in Fig. 17, the area inside the blue box is the detected text region, and then, we crop each text region and send it to our model. After the recognition of our model, as shown in Fig. 17, the left side is the detected logistics sheet image, and the right side is the recognition result corresponding to the text area one-to-one. Almost 100% accuracy can be achieved.

Images with black background
As shown in Fig. 18, this kind of image is collected by the camera in the distribution center. Because the light is relatively dark, the background of the text image is very black. The recognized text regions include long and short text sequences, English and Chinese characters. We can hardly distinguish this kind of image with our eyes, but our model can give almost all correct recognition results, indicating that our model can adapt to changes in lighting and can handle some extreme background images.

Blurred images
Usually, it takes 2-3 days to complete a logistics package shipment, these logistics sheets attached to the package are prone to friction, resulting in stroking blurred in the text area, and an example is shown in Fig. 19. We use our proposed Chinese dataset which contains stroking blurred text as training dataset. Although the image is blurred, our model can handle this scene and still identify the key information, including recipient's address, name and phone number, and sender's address, name and phone number. This shows that our proposed model effectively enhances the ability of carrying semantic information and models the long-range dependency between the characters. Although the recipient's phone number is hard to distinguish, our model can predict the phone number correctly after careful identification. After zooming in on the image, we can see that our model still gives an accuracy rate close to 100% on recipient's information. The sender region of this image is very small and in the super low-resolution condition, especially the sender's address information which marked inside the red box, only 60% of sender address information is recognized correctly in red box, indicating that our model still faces the challenge if the text image is too blurred and the text region is too small; it means that the text is in the super low-resolution condition.

Twisted images
The logistics images in the distribution center may be collected at any angle; if the package processing volume is too large, the operator's negligence is prone to image distortion, which brings great difficulty to recognition. As shown in Fig. 21, the image is distorted, and the overall image is in a low-resolution state, which causes the entire sender information failed to be recognized, which marked with red rectangular box. Even though we zoom in on this image, our eyes cannot distinguish the sender regions, because the image quality is too poor, the sender text area is too small, and the fonts are connected. While, in the recipient area, although  it is also in the distorted state, our model misidentified two characters, which are marked with small red boxes, respectively. By analysis, we get the conclusion that the recipient area is larger comparing the sender area; if the text region is both blurred and small, it is really hard to recognize.

Degraded image
This is the most challenging logistics image, like Fig. 22, text regions are close to the missing state, as shown in Fig. 22, the sender's text regions are missing, the recipient's region is in the super low-resolution state, and the overall recognition rate is only 50%. At present, such images are regarded as nonstandard images in actual operation, because it is difficult to recognize and handle this kind of image.

Comparisons of metrics between our proposed datasets with public datasets
We divide these public datasets into irregular text datasets and regular text datasets, which is same as reference [42,44,45]. To better compare our proposed datasets with these public datasets, we carefully analyzed the distribution of each dataset and randomly selected 1200 from our dataset for analysis. For the regular dataset, we selected IC13, whose most of the images are relatively clear, and there are few curved images in IC13; generally, most of the regular datasets are relatively easy to recognize. For the irregular dataset, we selected CUTE80, IC15, and SVTP for comparison.
To verify the rationality and validity of our proposed dataset, we quantitatively analyze the distribution of curved and blurred images in each dataset; the curved and blurred images are the most challenging points in text recognition; combining the performance of each dataset on our proposed method, we then analyze and compare our dataset with the publicly datasets.
First of all, these public datasets are all English characters. We need to solve the recognition of Chinese logistics sheet images, so we propose this Chinese dataset. As shown in Table 8, we set the following metrics for analysis, total images mean the number of images in each dataset, curved images denote the number of curved images in each dataset, blurred images denote the number of blurred images in each dataset, curved and blurred images denote the number of blurred images in metric of curved images, which mean these images are both curved and blurred, blurred ratio denotes the ratio of blurred images to the total images, and curved ratio denotes the ratio of curved images to the total images. Accuracy on our proposed method denotes that the dataset was tested on our proposed method with the performance of accuracy.
First, we analyze the distribution of curved images. There are 113 curved images in our total 1200 images, and our dataset's curved ratio is 9.00%. IC13 as the regular dataset in public datasets only have four curved images, its curved ratio is only 0.39% which is the lowest one, and it has the highest accuracy on our proposed method. CUTE80 owns the highest curved ratio and all most half of CUTE80's images are curved; however, its accuracy on our proposed dataset is second only to IC13, which is an interesting finding.
Second, we analyze the distribution of blurred images, there are 108 blurred images in our total 1200 images, we find an interesting thing is that IC15 as the irregular dataset has the highest blurred ratio in all these datasets, and its accuracy on our proposed method is the lowest one in these datasets; by analysis, IC15 is the most challenging dataset in public datasets; most of its images are in blurred or low-resolution condition. Although CUTE80 has the highest curved ratio, Bold is to emphasize that we got different accuracy value on different dataset, which can be compared clearly however, the highest curved ratio has small effect on its accuracy on our proposed method. In general, our proposed dataset is similar SVTP, our curved ratio and blurred ratio are close, and our accuracy on the model is also similar. This prompted us to think, how to improve text recognition ability and what is the most challenging things that determine the performance of the text recognizers. We have the following conclusions. First, blurred ratio is the most key point that should be paid attention to, because it affects the performance of accuracy more. Although nearly half of CUTE80's images are curved, its accuracy on our proposed method is the second best, and it indicates that our rectification module can effectively rectify the curved images into horizontal state. Besides, some artist fonts are other difficult points which should be focused on.

Training speed analysis
In this section, we conduct an analysis to compare training speed. We train Aster and our proposed model on NVIDIA V100 GPU 32 G two cards. We utilize the SynText and Synth90K as training datasets, and batch size is set to 512. As indicated in Table 9, Aster requires an 80.5 Mb storage resource, and its training speed is 0.53 s per batch. That of our proposed model is approximately 0.22 s per batch, which is faster than that of Asters due to the parallel computation in our proposed model. Although our proposed model requires more storage resource, it improves low-quality text recognition performance. Some self-attention models may require more storage memory to have more powerful capability to meet real requirements, such as some methods in the field of NLP. Moreover, we select images from CUTE80 to test the inference speed, as is shown in Table 9. The evaluation speed of our proposed model is 10.1 ms per image, which is also better than that of Aster.

Comparisons with state-of-the-art methods
The performance comparison between several previous excellent algorithms and our proposed method on benchmark datasets is provided in Table 10. With the help of the rectification module, SCM, GCM, and the transformerbased decoder, our proposed text recognizer achieves higher recognition results on three datasets, namely, the two irregular datasets IC15, SVTP, and the regular dataset IC13. This finding indicates that the rectification module effectively rectifies the shape of irregular text. Moreover, we assert that context modeling plays an important role in recognizing lowresolution images, such as images in SVTP. Meanwhile, the visual features extracted by CNN are important complements to the global semantic information, achieving 2%, 0.7%, and 3.1% improvement on the IC13, IC15, and SVTP datasets, respectively, compared with second best results in Table 10.

Conclusions
In this study, we assert that context modeling plays a significant role in a robust and efficient text recognizer. We propose a novel model that combines CNN and a lightweight transformer-based decoder for text recognition. This model can be trained in parallel and exhibits the capability for global computation. We use the SCM to improve full-image information-carrying capability, while the GCM is integrated Bold means the best accuracy value got by our method on IC13, IC15 and SVTP when we compared with previous methods to enhance long-range dependencies. Moreover, a Chinese text dataset is proposed for solving Chinese character recognition, particularly in express sheet images. In addition, the function that constructs our proposed dataset may provide inspiration for other text recognition scenes given that natural scenes are complex. The model performs well on five publicly available datasets, particularly in irregular and lowresolution cases. Inspired by the performance of language models in the field of NLP, we will add language model for multi-modal learning to correct too blurred characters in the future.