Introduction

This paper introduces a new text recognition technique for the logistics industry. As we know, logistics information plays the role of package transportation and marking the important information for the purpose of contacting customers. In the past 10 years, thanks to the improvement of internet technology, China's e-commerce industry has developed rapidly, which has also driven the development of the logistics industry. The huge domestic purchase demand generates more than 100 million express packages every day. The logistics industry provides logistics service for all walks of life and promotes economic development. However, how to handle such a huge number of express packages every day? In addition to various automation technologies, scene text recognition (STR) technology also plays an important role; STR not only plays the significant role of intelligent sorting in distribution centers, but also customer information can be extracted quickly in the end distribution, which replaces a large number of manual operations and gives companies great cost savings, because labor costs are getting higher and higher. In the distribution center, the express package obtains the destination code through STR technology, and will be intelligently transmitted to their respective concentration areas, waiting to be transported to the destination. In our survey, although text recognition research has been applied in many fields, such as bank note recognition, license plate recognition, and train ticket recognition, while minimal research has been conducted in the field of logistics wherein images of express sheets captured by cameras are mostly curved, distorted, and have low resolution. We aim to improve the low-quality text recognition for logistics industry, especially the curved and low-quality images. Generally, the STR model consists of four parts [1]: the rectification, feature extraction, sequence module, and decoder module. A spatial transformer network (STN) [2] is utilized as a rectification module to reduce pressure in the next stage. Second, feature extraction networks, which are implemented by deep learning networks, such as a residual neural network (ResNet) [3], have achieved remarkable progress. The prediction module typically adopts a recurrent neural network (RNN) [4] to sequence text order. An RNN is a sequential model that can process sequential data; however, it is limited by slow training time and lacks long-range dependencies. Long short-term memory, which is a variant of RNN, is the most important solution to alleviating these limitations. Nonetheless, it disregards parallel computing and is even slower than RNN [5]; hence, a self-attention module is introduced to realize parallel computing given its capability to model long-range dependency relationships. Finally, the decoder module has two options: the connectionist temporal classification decoder [4] or the attention-based decoder [6]. Although, the combination of these modules pushes the development of STR, many challenges, such as irregular and seriously curved and low-resolution text images, still exist. Text in the express sheet image shown in Fig. 1 is extremely blurred and some Chinese characters cannot be recognized because of the motion blur captured by the camera.

Fig. 1
figure 1

Express sheet images in the real scenes which have mosaicked the sensitive information

To solve the huge demand of logistics industry, two aspects of text recognition technology need urgent attention; one is the recognition accuracy rate and the other is the recognition speed. Currently, the recognition accuracy rate still needs to be improved in logistics industry. The challenge comes from two parts: on one hand, the logistics sheet is prone to frictional deformation during transportation, and then becomes a low-quality image. On the other hand, these images are captured by cameras, which are prone to blurred and low-resolution images after being compressed. To address this issue, we conduct research from both datasets and method. In terms of datasets, at present, Chinese datasets for real scene text recognition remain insufficient. Most text recognizers are trained on benchmark datasets to compare the accuracy of six public evaluation databases. However, these public training datasets do not have Chinese characters. We know that there are 26 English letters, but there are more than 5000 commonly used Chinese characters in Chinese scenes. Moreover, these images for training should be as close as possible to real scene images, that is, these images should include blurred, missing-stroke, and motion-blurred images to meet performance requirements in Chinese scenes. Hence, we propose a novel Chinese dataset for recognizing low-quality images in Chinese scene text recognition. In terms of method, to improve the recognition accuracy of complex logistics images, we adopt the transformer model to improve global computation and information-carrying capabilities, the transformer proposed by Vaswani et al. [7] used a self-attention network as its encoder and decoder module to extract the relationships between characters. The major advantages of transformer-based methods over other models are their global computation and perfect information-carrying capabilities; in recent research, models based on transformers have been applied to computer vision tasks [8,9,10,11]. However, transformer [7] is composed of a stack of six identical layers, respectively, in its encoder and decoder module, which makes the model too heavy and slower. To meet the speed requirement of 10 FPS (frames per second) in logistics industry, we simplify the architecture of transformer [7], and we first removed the huge encoder part in the transformer, and only used the decoder part of transformer as our encoding part with only 3 identification layers, while [12, 13] use 12 identification layers in encoding part and 6 identification layers in decoding part for text recognition, although they achieved remarkable results on public datasets; however, they ignore the requirement of speed in the practical application, and [12,13,14,15] all adopt the whole transformer network as the architecture of their methods for text recognition. Without the encoder in transformer, we add two small modules as a complement. First, we designed a residual crisscross attention in the semantic context module (SCM) to capture full-image dependencies and generate rich semantic context information, and we quit recurrent crisscross attention [16], because the residual network which is similar to ResNet network [3] can effectively solve the feature degradation and establish a long dependency of text information in logistics sheet image, especially the address information, because the address information often includes the province, the city, the country, and the detailed home address which is very long text sequence. Second, we propose a smaller module global context module (GCM) to improve global computation and information-carrying capabilities. To avoid character missing and attention drift [14] when recognizing long text sequence in practical scenes, especially the large number of address information in logistics sheet images, we add the learnable character weight information into self-attention module of GCM, which can emphasize the character region in the text sequence and prevent the feature from weakening during the operation. Considering transformer-based methods are not as powerful as CNNs in terms of acquiring local information, to address these disadvantages, we utilize a ResNet-based extractor as part of our proposed method in the current study.

The major contributions of this study can be summarized in four aspects:

  1. 1.

    We construct a new Chinese dataset for text recognition. This dataset is described in detail in the section “Introduction of datasets”, and it focuses on solving the problem of recognizing low-quality Chinese text images in logistics industry.

  2. 2.

    We propose a novel end-to-end text recognition method that adds the idea of context modeling throughout the entire network, and adopts a lightweight transformer-based decoder to accelerate training speed. The proposed method achieves competitive results on several benchmark datasets due to its information-carrying capability.

  3. 3.

    To enhance semantic information and obtain richer sematic context information, we propose a semantic context module (SCM) with a residual module to capture full-image context information.

  4. 4.

    Replacing the RNN structure, we propose a GCM that is designed to grasp long-range dependencies between pixels and capture essential information for global understanding. To avoid character missing, the learnable position information was added into self-attention module, which can emphasize the text regions. The GCM is introduced to connect the SCM and the transformer-based decoder, playing a role in transmitting rich global context information to the next stage.

Related works

There are three main difficulties in current text recognition technology, one is the existence of a large amount of curved text images in natural scenes, the second is the existence of a large amount of low-resolution text images, and the third is the requirement of faster recognition speed in practical application. Many related works have focused on these challenges.

For curved texts in natural scenes, considering that if the entire text area is sent to the text recognizers, the recognition effect will be poor due to a large number of invalid areas. Shi et al. [17] proposed an automatic rectification module to solve the curved text, and the image is transformed into a regular and more readable image through the spatial transformation network. This transformation can rectify different types of irregular text to improve the recognition ability of curved text. To simplify the complexity of the rectification module, Shi et al. [18] further proposed a new rectification network, namely, a thin-plate spline transformer based on an attention mechanism was proposed for recognizing irregular text, the model learns directly to control the curved text point, and then straightens the curved text into horizontal condition. Comparing [17, 18], Zhan et al. [19] proposed the repetitive rectification frame which can continuously rectifies the text regions to improve the text recognition performance, while [17, 18] rectify the text regions only once. Although these methods made huge contributions to curved text recognition, they did not yet solve the low-quality text images.

To solve low-quality text recognition, Zhang et al. [20] proposed a sequence-to-sequence domain adaptation network for recognizing blurred text images. Gate attention similarity units were used to adapt the distribution of attentional information to the corresponding sequence data, acquiring better feature information. Given that scene text methods seldom combine visual and semantic information, and most methods rarely use semantic information to assist in text recognition, Wan et al. [21] combined the mechanisms of attention and segmentation models to provide better results, considering an attention-based model performs weakly when dealing with low-resolution images, a segmentation model can better use visual features. Yu et al. [22] proposed a novel semantic inference network (SRN) to capture abundant semantic text information, which can effectively combine visual context information and semantic context information, and the proposed method segmented characters one by one, and then spliced them into horizontal states, achieving better performance on public datasets. Inspired by language models which can effectively supervised word sequence generation, Qiao et al. [23] added a pretrained language model to supervise encoding and decoding procedures, supervising training of predicted semantic information through a pretrained word embedding model; this scheme is better for solving low-resolution text recognition. Zhang et al. [24] used a text recognition framework for automatic search. The base framework can be adjusted for different datasets as inspired by a neural framework search. These models integrated multiple modules and tended to be large models, taking into account the recognition of low-resolution text; furthermore, considering the speed requirement of the text recognition, some studies have also focused on the speed of text recognition as follows.

In terms of text recognition speed, Liu et al. [25] found that text features can be encoded into binary format without loss of semantic information, and real-time inference can be achieved by binary compression; meanwhile, the memory consumption can be very small; however, its recognition performance on benchmark datasets is not ideal, although the speed has better performance. Taking into account the transformer-based methods have the advantages of parallel computation in the field of natural language processing, Zhu et al. [15] directly combined heavy backbone and the whole transformer network as a new text recognizer, and a hierarchical attention mechanism including four self-attention blocks was proposed in the encoding part to describe context information, which is also a heavy model. To make text recognition technology more suitable for real-world applications, Li et al. [26] proposed a transformer-based model, a self-attention mechanism was applied to text recognition to reach a better ideal state. This model utilized locality-sensitive hashing instead of softmax regression to compress model. Parameters are reduced, the small model achieves better speed, but its recognition performance is not ideal comparing other text recognizers, and it is not effective for low-resolution text recognition. Recently, based on transformer network, Lee et al. [12] proposed an adaptive 2D positional encoding to strengthen feature extraction and further solve irregular images, and it achieved remarkable results on public datasets, especially in irregular datasets; however, it has 12 self-attention layers in its encoding part, while transformer [7] only has 6. To enhance the performance of the feed-forward network (FFN), Kim et al. [13] modified the encoder part of the transformer by replacing with a squeeze-and-excitation FFN. However, they also adopted the overall framework of the transformer, and the inference speed did not yet meet the requirement when facing logistics scenes. In general, with self-attention modules, the transformer-based methods can achieve the state-of-art effect on text recognition comparing simple CNN-based text recognizers. However, as we know, the transformer has better performance, it is also a large model; hence, speed should be taken into consideration when facing logistics scenes. For some special scenes, text recognition technology is more and more related to real industrial applications, Ren et al. [14] applied the transformer network into text recognition to solve the shopping receipt scene; to solve the problem of attention drift in transformer network, and they proposed a transformer-based decoupled attention network that can decouple the prediction processes in an attention mechanism. Unfortunately, all these models have not yet solved the Chinese scenes, especially the complex and challenging logistics industry scene. On the other hand, text recognizers with transformer network can solve some challenging low-quality text images; however, the speed cannot meet the requirements when facing practical applications. These transformer-based methods had not attempted to simplify the whole transformer network which can accelerate the text recognition. Therefore, in this survey, we propose a text recognizer by simplifying the transformer network, and add two small modules as a complement, which can further enhance context modeling and global long-range dependency for solving low-quality text images.

Introduction of datasets

To recognize express sheet information better, we construct a large-scale dataset that contains more than 5000 commonly used Chinese characters. The corpus contains Chinese address information of most cities and counties in 32 provinces in China. This information is collected in the business system of a Chinese express company. The dataset, which contains 5 million images, is randomly divided into two parts: 50,000 images are in the fixed testing sets and the remaining images are in the fixed training sets (Table 1). We construct the dataset by utilizing several functions, such as blur, image angle transformation, motion blur, and missing strokes. The principle is performed as follows.

Table 1 Statistics of the proposed Chinese dataset (CD)

Details of these functions

Blurred images

Blurred images exist widely in real scenes, and many words are difficult to identify. We use different kernels to blur images randomly, and an example is provided in Fig. 2a.

Fig. 2
figure 2

Examples of our proposed Chinese text images

Image angle transformation

Image angle transformation consists of three algorithms, namely, random angle transformation, affine transformation, and perspective transformation, the angle of which changes from 15° to 45°. A corresponding example is presented in Fig. 2b.

Motion blur

In a real scene, the quality of pictures captured by cameras can be degraded because of motion blur. The purpose of motion blur is to blur an image in the left and right directions. The middle column of a kernel is numerically processed and used as a new convolution kernel for operation. The effect is illustrated in Fig. 2c.

Missing strokes

Many cases of missing strokes occurred in real scenes, because the images are easily rubbed. This situation poses huge challenges to text recognition, particularly for images with more than 5000 commonly used Chinese characters. We adopt the following method in the current study. We randomly replace the background pixels of the text area with background color, which is typically white, to achieve the effect of missing strokes. The effect is depicted in Fig. 2d. The detailed procedure for missing strokes is summarized in Algorithm 1.

figure a

Approaches

The proposed model is divided into five parts: the rectification network, the CNN extractor, the SCM, the GCM, and the transformer-based decoding module. To overcome the influence of irregular images, we first resize the images into \(32\times 100\), and then, the shape is reshaped into \(32\times 64\) to find the control points in the rectification module. These points are used to make the text line horizontal. Furthermore, a CNN-based module is utilized to extract 2D features instead of 1D features which are typically extracted using traditional recognizers. The output of the last layer of the CNN module is \(4\times 25\), where 25 is the width of the feature map that can preserve more long horizontal pixel information and is beneficial for long text image recognition. We also adopt the SCM to obtain richer semantic information and generate abundant semantic context features. The GCM is added after the SCM to construct long dependencies as the role of sequence context modeling. Subsequently, we also adopt a lightweight transformer decoding part, with N = 3 blocks, and each block includes a masked multi-head attention mechanism, a multi-head attention mechanism, a feed-forward network (FFN), and layer normalization. The overall framework of the entire network is illustrated in Fig. 3.

Fig. 3
figure 3

The overall framework of our model

Rectification network

The rectification network framework in [2], which is divided into three parts: a localization network, a grid generator, and a sample generator. The localization network is utilized to detect the control points in the graph and output their location. The grid generator calculates the mapping relationship of each point that corresponds to the control points and generates the coordinate position of points \(\left\{P1\dots \right.\left.Pn\right\}\). The sample generator samples the point positions calculated by the grid generator to generate the rectified graph. Finally, the original image is rescaled to a fixed size.

Backbone network and feature enhancement module

The recognition network is based on Yu et al. [22], which used ResNet50 [28] to extract text image features. The output shape of every layer is provided in Table 2. The shape of the last layer of the extraction module is \(4\times 25\times C\), where 4 and 25 is the height and width of the input image, respectively.

Table 2 Structure of CNN blocks

To obtain richer semantic information, we first propose the feature enhancement module to fuse high- and low-level semantic information. The features of layers 3, 4, and 5 extracted by the CNN network are fused into one feature map.

SCM

The SCM was designed as a residual crisscross attention module, which is shown in Fig. 4, and it contains two convolution modules which play the role of enhancing feature extraction. We quit recurrent crisscross attention [16], because the residual network which is similar to ResNet network [3] can effectively solve the feature degradation and establish a long dependency of text information in logistics sheet image, especially the address information, because the address information often includes the province, the city, the country, and the detailed home address which is very long text sequence.

Fig. 4
figure 4

The semantic context module. CCA denotes the crisscross attention

The crisscross attention [16] proposed for semantic segmentation can capture context information in a more efficient manner, which is presented in Fig. 5.

Fig. 5
figure 5

The crisscross attention module

The crisscross attention is first applied on X to output Q and K, and \(\{Q,K\}\in {R}^{{C}^{\mathrm{^{\prime}}}\times W\times H}\).\({C}^{\mathrm{^{\prime}}}\) is the channel of the feature map. Then, attention map A is generated by affinity operation, which is expressed as follows:

$$ d_{i,j} = Q_{j} K_{i,j}^{T} , $$
(1)

where j represents each position of Q, and \({Q}_{j}\in {R}^{{C}^{\mathrm{^{\prime}}}}\); \({K}_{i,j}\) is the i-th element of \({K}_{j}\); \({d}_{i,j}\in D\) is the degree of correlation between \({Q}_{j}\) and \({K}_{i,j}\); and then, softmax was applied to D to compute attention map A.

Meanwhile, X generates V as same operation as Q, K. Context information is enriched via aggregation operation. However, single crisscross attention can only capture horizontal and vertical context information. To address this issue, we proposed a residual crisscross attention module that can capture full-image context information to distinguish better between text and non-text areas, particularly for blurred and low-resolution images. Generally, the SCM receives X as input and outputs feature map Xʹʹ, which can be computed as follows:

$$ X^{\prime\prime} = CCA(X \oplus CCA(X)) \oplus X, $$
(2)

where CCA (\(\bullet \)) denotes the crisscross attention function; and \(\oplus \) denotes element-wise addition.

GCM

The GCM includes a context modeling unit, a self-attention part, and a position encoding module, as shown in Fig. 6. Inspired by [30], we adopt the global context block as the context modeling unit that can model long-range dependencies. We utilize element-wise addition as the connection, which can broadcast attention values accordingly during addition, enabling focusing on the collection of crucial clues about unique character features. To solve the long text sequence recognition and attention drift, we add the learnable character weight information into self-attention module of GCM, as follows.

Fig. 6
figure 6

Framework of Global Context Module. \(\oplus \) denotes element-wise addition

After global context block, we compress the height of feature map into one, and get a feature sequence, and this feature sequence represents the text sequence. The traditional image algorithm processes the image into a binarized feature map, so that the pixel value in feature map is in the [0,1] interval, where the closer the value of \({X}_{i,j}\) is to 1, the higher the probability that it is a text pixel, \({X}_{i,j}\) denotes the pixel value in binarized map.

As shown in Fig. 7, we use a more optimized function named \(f\left(x\right)\) for binarization processing whose function value range is in the [0,1] interval, the function is a differentiable function, and the optimal network parameters can be learned during the training process. A better threshold can be learned during training, enabling the feature to be dynamically processed into a binarized feature map, so that the text region is more obviously be presented. The function is as follows:

$$ f(X) = \frac{1}{{1 + e^{{ - k \cdot X_{i,j} }} }}, $$
(3)

where \({X}_{i,j}\) denotes the pixel value in feature map after global context block.

After Eq. (3), we get the binarized feature map \(\Theta =(\theta 1,\dots ,{\theta }_{L})\), where L represents the width of the feature map, the closer the pixel value in feature map \(\Theta \) is to 1, the higher the probability that it is a text pixel, the binarized feature map plays the role of positional weights. Inside the self-attention module, we get the text sequence map \(h= \left(h1,\dots ,{h}_{L}\right)\) after the operation of self-attention; meanwhile, we added the learnable position information into feature map \(h\); the operation is as follows:

$$ {\text{\rm H}} = h * {\Theta ,} $$
(4)

where \(*\) represents the Hadamard product.

Fig. 7
figure 7

Illustration of f(x) which is utilized for binarization processing, where the closer the value is to 1, the higher the probability that it is a text pixel, we set the value of k to 80

By assigning position weight information to the self-attention mechanism, the disappearance of characters in the operation can be avoided, and the character area is more prominent; if pixel value in binarized map is close to 1, it emphasizes that this position is a text region.

In contrast with the traditional RNN module, GCM can be trained in parallel and exhibits a strong long-range dependency capacity.

Decoder

Through the GCM, the features that we acquired have better global information. Considering its good performance in the field of natural language processing (NLP), we directly select transformer-based blocks as the decoder instead of traditional attention-based decoder. Compared with the fundamental block in [7], we simplified the decoder part of transformer with only three identification layers, which is a lightweight decoder. Apart from the multi-head attention and FFN, a masked multi-head attention is also inserted and utilized to simulate dependencies between different decoding locations. In the decoder part, residual connections [29] around each sublayer are also used before layer normalization, as shown in Fig. 8, In the current study, we use H = 8 parallel attention in the multi-head attention mechanisms. The output of each sublayer can be summarized as LayerNorm(x + Sublayer(x)), where Sublayer(x) is a function that includes the multi-head attention and FFN.

Fig. 8
figure 8

The transformer-based decoder module

Components of fundamental blocks

Positional encoding

The transformer eliminates RNN, the greatest advantage of which is the abstract memory of data in a time series. Without position embedding, the corresponding position values of the attention map are swapped after exchanging character positions, leading to no sequence order information. Hence, we introduce the following positional encoding method [7] to add relative position information. The formula is as follows:

$$ {\text{PE}}_{{({\text{pos}},2i)}} = \sin \left( {{\text{pos}}/10000^{{2i/d_{{{\text{model}}}} }} } \right), $$
(5)
$$ {\text{PE}}_{{({\text{pos}},2i + 1)}} = \cos \left( {{\text{pos}}/10000^{{2i/d_{{{\text{model}}}} }} } \right), $$
(6)

where pos is the position, indicating the token in the sentence. dmodel = 512, i = 0, 1, …, 255.

Masked multi-head attention

Masked multi-head attention is used to simulate dependencies between different decoding locations. It plays an effective role in ensuring that the prediction of a time step is only followed by its previous time step information. A mask is utilized to stop each position from occupying positions after that particular position.

Multi-head self-attention

Compared with the simultaneous use of multiple filters in CNN, multi-head attention helps a network capture richer feature information and enables the model to pay more attention to different aspects of information. Multi-head attention can be summarized as [7]:

$$ {\text{MHA}}(Q,K,V) = \left[ {{\text{head}}_{{1}} {,}...,{\text{head}}_{h} } \right]]W^{O} \in {\mathbb{R}}^{\lambda \times d} , $$
(7)

where \({\mathrm{head}}_{i}=\mathrm{Attention}\left(Q{\mathrm{W}}_{i}^{q},\mathrm{ K}{\mathrm{W}}_{i}^{k},\mathrm{ V}{\mathrm{W}}_{i}^{v}\right)\in {\mathbb{R}}^{\lambda \times \frac{d}{\mathrm{H}}}\), and \(\mathrm{MHA}\left(\bullet \right)\) refers to the multi-head attention operation. In addition, \({\mathrm{W}}_{i}^{q}\in {\mathbb{R}}^{d\times \frac{d}{\mathrm{H}}}\), \({\mathrm{W}}_{i}^{k}\in {\mathbb{R}}^{d\times \frac{d}{\mathrm{H}}}\), \({\mathrm{W}}_{i}^{v}\in {\mathbb{R}}^{d\times \frac{d}{\mathrm{H}}}\), and \({\mathrm{W}}^{o}\in {\mathbb{R}}^{d\times d}\). H refers to the number of multi-head attention. In the current study, we use H = 8.

Attention is calculated using scaled dot-product calculation, which is faster and reduces computational stress after scaling. The scaled dot product accepts three parameters as input: query \({q}_{i}^{{\lambda }^{\star }}\in {\mathbb{R}}^{d}\), \(i\in [1, \lambda ]\), and a set of key-value pairs of d-dimensional vectors \(\{({k}_{i},{v}_{i}){\}}_{i=\mathrm{1,2},...,N}\) (\(\lambda \) denotes the number of queries, and N indicates the number of key-value pairs). Hence, the scaled dot product can be described as follows:

$$ {\text{Attention}}(q,K,V) = \sum\limits_{i}^{N} {\alpha_{i} {\text{v}}_{i}^{N} } \in {\mathbb{R}}^{d} , $$
(8)
$$ \alpha = {\text{sof}}t\max \left( {\frac{{\left\langle {\left. {q,k_{1} } \right\rangle } \right.}}{\sqrt d },\frac{{\left\langle {\left. {q,k_{2} } \right\rangle } \right.}}{\sqrt d },...,\frac{{\left\langle {\left. {q,k_{N} } \right\rangle } \right.}}{\sqrt d }} \right), $$
(9)

where \(\alpha \) is the attention weight, \(K=\left[{k}_{1},{k}_{2},...,{k}_{N}\right]\in {\mathbb{R}}^{N\times d}\), and \(V=\left[{v}_{1},{v}_{2},...,{v}_{N}\right]\in {\mathbb{R}}^{N\times d}.Q=\left[{q}_{1},{q}_{2},...,{q}_{N}\right]\in {\mathbb{R}}^{\lambda \times d}\).

Loss function

Finally, the cross-entropy loss function is adopted as the objective function of our model:

$$ L = - \frac{1}{N}\left( {\sum\limits_{i = 1}^{N} {y_{i} \log (s_{i} )} } \right), $$
(10)

where i refers to the index of the decoding position, y denotes the ground truth label, and s denotes the recognition result.

Datasets and experimental details

Commonly used evaluation datasets

To evaluate the robustness and effectiveness of the proposed model, we analyze its performance on the following six public databases:

IIIT5K [31], which includes 3000 images for testing. The images are street view and digital images.

SVT [32], which includes 647 images for evaluation, which are processed into a blurred and low-resolution condition using certain methods for evaluation.

IC13 [33], some of which images are inherited from IC03 [34], and we remove non-alphanumeric images without dictionary linkage.

IC15 [35], which is collected by Google Glasses, is the most complex dataset in recent years. Most images have varying levels of distortions and blurring.

SVT-P [36] contains 645 cropped images from Google Street View that are randomly distorted and have different perspectives, which is used for evaluation.

CUTE80 [27] includes 288 cut images for evaluation. Most of the images have a complex background, perspective scrambling, and low resolution.

Data augmentation

To increase image diversity and achieve a low-quality image condition, we implement some methods on the original images, including random angle, affine, and perspective transformations, to realize image rotation and scaling. We also use two methods, namely, gaussian blur and motion blur, to add noise to the original images and blur the images.

Implementation details

Our method is trained on SynText [37] and Synth90K [38], with approximately 12 million images as the training set. The experiments are conducted on NVIDIA V100 GPU with 32 G memory. We use Adam [39] as the optimizer, with a learning rate of 0.001. The final decay is 0.0005, and the batch size is 512.

Ablation study

Effectiveness of the rectification module

Considering that real scenes contain many irregular images, we select a representative dataset, CUTE80, as the testing set to verify the effect of the rectification module. As shown in Fig. 9, examples 1–6 effectively rectify irregular images into a horizontal state. For example, example 6 “football” is nearly rectified into a horizontal state by the rectification module.

Fig. 9
figure 9

The effectiveness of the rectification module in irregular images which are from CUTE80

To verify the role of the rectification module on the recognition capability of the overall network, we add and remove the rectification module from the model. The performance comparison results are presented in Table 3. After adding the rectification module, accuracy achieves 0.7%, 0.5%, 1%, 1.3%, and 1.4% improvement on the SVT, IC13, IC15, SVTP, and CUTE80 datasets. The data also show that the rectification module exerts a significant effect on the recognition of curved and irregular text.

Table 3 Comparing the performance of our model among several benchmarks with or without rectification module

We also perform an experiment to observe the feature of the first layer in the CNN extractor module. The heat map also demonstrates that the rectification module exhibits an evident effect on irregular images. An example is presented in Fig. 10, the raw image is presented in Fig. 9, Example 6.

Fig. 10
figure 10

The partial channel characteristic heat map generated after the second layer in CNN extractor

Effectiveness of the SCM

To verify the role of the proposed SCM, we add and remove the SCM to form two different models for comparison. Then, we select SVT, SVTP, and CUTE80 to comprise the testing set. Among them, the SVT dataset contains many severely blurred images, while SVTP has many low-resolution images. Moreover, CUTE80 contains many irregular images, some of which images are also in low-resolution condition. The experiment proves that adding the SCM can effectively improve the recognition of low-quality images, achieving 0.4%, 3.1%, and 3.5% improvement on the SVT, SVTP, and CUTE80 datasets, respectively, as indicated in Table 4. The results demonstrate that the SCM can improve the performance of the entire network, enriching semantic information.

Table 4 Comparing performance of our model among several benchmarks with or without SCM

We select some challenging images to evaluate the effect of the SCM. The feature heat maps generated with and without the SCM are shown in Fig. 11. Both heat maps are overlaid with the feature maps after the first layer in the CNN extractor to show text information. Evidently, the attention features are more robust concentrated on the text area with SCM. The images are clearer and have more abundant text information compared with the condition without SCM, particularly when the image is irregular and blurred, such as “CLUB” in Example 1, Fig. 11a, and the “CUISINE” in Example 3, Fig. 11c. It indicates that SCM can help to distinguish better between text and non-text areas.

Fig. 11
figure 11

Feature heat maps generated with SCM and without SCM

Effectiveness of the GCM

To verify the effectiveness of the GCM in sequence context modeling, we discuss the effects of adding and removing the GCM on the model. The number of fundamental blocks in the self-attention module is also discussed. We first remove the GCM. The accuracy indicated in Table 5 shows that the performance deteriorates when the GCM is removed. Moreover, the data show that the model reaches its optimal state when the number of blocks is one. Moreover, when considering the training speed in the entire network, the multi-head H hyperparameter is set to 8 which is best choice. When the number of blocks is increased to three, the results are relatively close to that when the number of blocks is 1. Hence, increasing the number of blocks makes no difference. The recognition rates achieved 2%, 0.4%, 1.5%, and 2.5% improvement on the IC15, CUTE80, IC13, and SVTP datasets, respectively, comparing with the condition without GCM. This finding implies that adopting the GCM in sequence context modeling helps improve the performance of the model, enhancing the capability for long-range dependencies.

Table 5 Performance of GCM among various widely used benchmarks on different parameters

Influence of the number of fundamental blocks in the decoder

To evaluate the hyperparameter in the decoder part, we examine its performance by testing its accuracy on IC13 and CUTE80. As indicated in the Table 6, we set the number of blocks to 1, 3, and 6, respectively, and we set the number of multi-head hyperparameters to 8 and 16, respectively. Evidently, the model reaches the best state when the number of blocks is three and the H hyperparameter of multi-head is eight. When we increase the number of fundamental blocks, the results become worse and the training speed becomes slower, implying that a higher number is not necessarily better. This finding is consistent with the experimental results in [7], and a smaller number of blocks will accelerate training speed.

Table 6 Performance of our model among various widely used benchmarks on different parameters in decoding part

Comparisons in our proposed dataset

Most text recognizers are trained on benchmark datasets, such as SynText, Synth90K, and SynthAdd, to compare the accuracy of six public evaluation databases. However, these public training datasets do not have Chinese characters. Hence, we have to utilize Chinese datasets for Chinses scenes, and we use our proposed dataset as training set to compare the ability of text recognizers.

To demonstrate the performance of our proposed model in our proposed dataset, we compared it with Aster [18], and DAN [42]. They were trained on the training sets of our proposed dataset. All of these models were tested on the testing dataset of our proposed dataset, and the results are presented in Table 7, where Data-Aug denote the data augmentation function.

Table 7 Comparison of several methods in our proposed dataset

As shown in Table 7, the comparison of Aster and our model showed that our proposed model produced better recognition results with a 6.9% improvement in the proposed datasets. Our model also performed better than DAN with a 2.8% improvement.

To explore the capability of our proposed model on recognizing Chinese real express sheet images, we compare it with DAN, and our proposed approach and DAN are first trained on our proposed dataset with the same settings, and then tested on real express sheet images. The recognized text images are cut from the real express sheet images. Among them, the left side represents the recognition results of DAN, while the right side shows the recognition results of our proposed model. As shown in Fig. 12, we can obviously observe that DAN erroneously recognizes the characters in red font. Our model gave the correct results in yellow font. By contrast, our proposed method provides better results when dealing with these challenging images in express sheet images.

Fig. 12
figure 12

The comparison between DAN and our model in real express sheet images; the left side stands for the recognition results of the DAN and the right side shows the recognition results of our proposed model

Recognition of real logistics sheet images

Each express package contains an express sheet, which includes some important shipping information and customer information, such as sender information, recipient information, and shipping destination code, as shown in Fig. 13. Sender information and recipient information both have the information of address, name, and telephone num. These information plays the significant role of intelligent sorting in distribution centers, but also customer information can be extracted quickly in end distribution.

Fig. 13
figure 13

The sample of express sheet image which we mosaicked the sensitive information such as name and telephone num for preventing disclosure of private customer information

We first use the text detector to detect the key information which we want to recognize on the express sheet images, and then use our proposed text recognizer to predict the results. We choose two difficult scenes to verify the performance of our model, and our proposed model was trained on our proposed Chinese dataset and tested on real express sheet image.

Curved express sheet recognition in end distribution

To test the recognition performance of our model for curved text, we selected an express sheet image, as shown in Fig. 14; the left part is express sheet image and the right part is the recognition result in red box. We first detect each region which we want to recognize by a text detector, these regions were detected in blue boxes, and then input each region to our proposed text recognizer. Figure 14 shows that our model performs well for curved text, as well as for multi-scale, long and short text. Although sender information is in the blurred and small conditions, our proposed model still ensures the recognition accuracy. The recognized text contains English characters, numbers, and Chinese characters. This finding shows our model can solve the curved express sheet images with the rectification module.

Fig. 14
figure 14

The sample of curved express sheet image and its recognition result. The program was conducted on Python. We mosaicked the sensitive information such as name and telephone num. The left part is express sheet image, and each recognition region was detected in blue box, the right part is the recognition result in red box

Blurred express sheet recognition in end distribution

We likewise selected the blurred, low-quality express sheet image to test our model, and we picked the logistics sheet image which is shown in Fig. 15; the upper part is the express sheet image and the below part is the corresponding result. From Fig. 15, we can observe that our model misidentified two characters in small red box of recognition part, because they are too blurred. Despite that, our proposed model can still handle the remaining blurred characters, including English characters, numbers, and Chinese characters. By experimental analysis, unless the character is too blurred and in the most challenging condition, our proposed model can ensure the accuracy of recognition; it indicates that the idea of context modeling plays an important role in enhancing sematic information and the capability for long-range dependencies.

Fig. 15
figure 15

The sample of blurred express sheet image and its recognition result. We mosaicked the sensitive information such as name and telephone num

Experiments on multi-challenging express sheet images from distribution center

At present, over 100 million express packages are handled every day in China. With the development of the e-commerce industry, the logistics industry still has a promising future, but artificial intelligence technology should be introduced to improve the automatic sorting and deliver packages. The logistics industry adds text recognition technology to the automatic sorting line, replacing manual labor to realize the automatic distribution. To test the effectiveness of our proposed model, we select challenging images in the distribution center to test the performance of the method. The images from end distribution were collected by personal phone, such as Figs. 14, 15, all the express sheet images in distribution center are captured by camera like Fig. 16, and we first crop the sheet region by our detector. Generally, we divided these images into the following six scenes:

Fig. 16
figure 16

The raw logistics sheet image captured from camera in distribution center

  1. 1.

    Regular scenes

This kind of image can be seen clearly which is shown in Fig. 17, the whole image does not suffer from damage, and text regions are very clear. When we recognized these images, we first use text detector to detect all the key information in the image, as shown in Fig. 17, the area inside the blue box is the detected text region, and then, we crop each text region and send it to our model. After the recognition of our model, as shown in Fig. 17, the left side is the detected logistics sheet image, and the right side is the recognition result corresponding to the text area one-to-one. Almost 100% accuracy can be achieved.

Fig. 17
figure 17

The recognition of regular image scene, the left side is the detected logistics sheet image, and the right side is the recognition result corresponding to the text area one-to-one

  1. 2.

    Images with black background

As shown in Fig. 18, this kind of image is collected by the camera in the distribution center. Because the light is relatively dark, the background of the text image is very black. The recognized text regions include long and short text sequences, English and Chinese characters. We can hardly distinguish this kind of image with our eyes, but our model can give almost all correct recognition results, indicating that our model can adapt to changes in lighting and can handle some extreme background images.

Fig. 18
figure 18

The recognition of image with black background

  1. 3.

    Blurred images

Usually, it takes 2–3 days to complete a logistics package shipment, these logistics sheets attached to the package are prone to friction, resulting in stroking blurred in the text area, and an example is shown in Fig. 19. We use our proposed Chinese dataset which contains stroking blurred text as training dataset. Although the image is blurred, our model can handle this scene and still identify the key information, including recipient’s address, name and phone number, and sender’s address, name and phone number. This shows that our proposed model effectively enhances the ability of carrying semantic information and models the long-range dependency between the characters.

Fig. 19
figure 19

The recognition of blurred image

  1. 4.

    Dirty text images

Generally, the images in logistics industry are complex and diverse. For example, in the image like Fig. 20, text areas have a lot of stains and some regions are very blurred. Although the recipient's phone number is hard to distinguish, our model can predict the phone number correctly after careful identification. After zooming in on the image, we can see that our model still gives an accuracy rate close to 100% on recipient's information. The sender region of this image is very small and in the super low-resolution condition, especially the sender’s address information which marked inside the red box, only 60% of sender address information is recognized correctly in red box, indicating that our model still faces the challenge if the text image is too blurred and the text region is too small; it means that the text is in the super low-resolution condition.

Fig. 20
figure 20

The recognition of dirty and super blurred images with small text regions

  1. 5.

    Twisted images

The logistics images in the distribution center may be collected at any angle; if the package processing volume is too large, the operator's negligence is prone to image distortion, which brings great difficulty to recognition. As shown in Fig. 21, the image is distorted, and the overall image is in a low-resolution state, which causes the entire sender information failed to be recognized, which marked with red rectangular box. Even though we zoom in on this image, our eyes cannot distinguish the sender regions, because the image quality is too poor, the sender text area is too small, and the fonts are connected. While, in the recipient area, although it is also in the distorted state, our model misidentified two characters, which are marked with small red boxes, respectively. By analysis, we get the conclusion that the recipient area is larger comparing the sender area; if the text region is both blurred and small, it is really hard to recognize.

Fig. 21
figure 21

The recognition of twisted image

  1. 6.

    Degraded image

This is the most challenging logistics image, like Fig. 22, text regions are close to the missing state, as shown in Fig. 22, the sender's text regions are missing, the recipient’s region is in the super low-resolution state, and the overall recognition rate is only 50%. At present, such images are regarded as non-standard images in actual operation, because it is difficult to recognize and handle this kind of image.

Fig. 22
figure 22

The recognition of degraded image

Comparisons of metrics between our proposed datasets with public datasets

We divide these public datasets into irregular text datasets and regular text datasets, which is same as reference [42, 44, 45]. To better compare our proposed datasets with these public datasets, we carefully analyzed the distribution of each dataset and randomly selected 1200 from our dataset for analysis. For the regular dataset, we selected IC13, whose most of the images are relatively clear, and there are few curved images in IC13; generally, most of the regular datasets are relatively easy to recognize. For the irregular dataset, we selected CUTE80, IC15, and SVTP for comparison.

To verify the rationality and validity of our proposed dataset, we quantitatively analyze the distribution of curved and blurred images in each dataset; the curved and blurred images are the most challenging points in text recognition; combining the performance of each dataset on our proposed method, we then analyze and compare our dataset with the publicly datasets.

First of all, these public datasets are all English characters. We need to solve the recognition of Chinese logistics sheet images, so we propose this Chinese dataset. As shown in Table 8, we set the following metrics for analysis, total images mean the number of images in each dataset, curved images denote the number of curved images in each dataset, blurred images denote the number of blurred images in each dataset, curved and blurred images denote the number of blurred images in metric of curved images, which mean these images are both curved and blurred, blurred ratio denotes the ratio of blurred images to the total images, and curved ratio denotes the ratio of curved images to the total images. Accuracy on our proposed method denotes that the dataset was tested on our proposed method with the performance of accuracy.

Table 8 Comparisons of metrics between our proposed datasets with public datasets

First, we analyze the distribution of curved images. There are 113 curved images in our total 1200 images, and our dataset’s curved ratio is 9.00%. IC13 as the regular dataset in public datasets only have four curved images, its curved ratio is only 0.39% which is the lowest one, and it has the highest accuracy on our proposed method. CUTE80 owns the highest curved ratio and all most half of CUTE80’s images are curved; however, its accuracy on our proposed dataset is second only to IC13, which is an interesting finding.

Second, we analyze the distribution of blurred images, there are 108 blurred images in our total 1200 images, we find an interesting thing is that IC15 as the irregular dataset has the highest blurred ratio in all these datasets, and its accuracy on our proposed method is the lowest one in these datasets; by analysis, IC15 is the most challenging dataset in public datasets; most of its images are in blurred or low-resolution condition. Although CUTE80 has the highest curved ratio, however, the highest curved ratio has small effect on its accuracy on our proposed method.

In general, our proposed dataset is similar SVTP, our curved ratio and blurred ratio are close, and our accuracy on the model is also similar. This prompted us to think, how to improve text recognition ability and what is the most challenging things that determine the performance of the text recognizers. We have the following conclusions. First, blurred ratio is the most key point that should be paid attention to, because it affects the performance of accuracy more. Although nearly half of CUTE80's images are curved, its accuracy on our proposed method is the second best, and it indicates that our rectification module can effectively rectify the curved images into horizontal state. Besides, some artist fonts are other difficult points which should be focused on.

Training speed analysis

In this section, we conduct an analysis to compare training speed. We train Aster and our proposed model on NVIDIA V100 GPU 32 G two cards. We utilize the SynText and Synth90K as training datasets, and batch size is set to 512. As indicated in Table 9, Aster requires an 80.5 Mb storage resource, and its training speed is 0.53 s per batch. That of our proposed model is approximately 0.22 s per batch, which is faster than that of Asters due to the parallel computation in our proposed model. Although our proposed model requires more storage resource, it improves low-quality text recognition performance. Some self-attention models may require more storage memory to have more powerful capability to meet real requirements, such as some methods in the field of NLP. Moreover, we select images from CUTE80 to test the inference speed, as is shown in Table 9. The evaluation speed of our proposed model is 10.1 ms per image, which is also better than that of Aster.

Table 9 Comparisons of different models in model efficiency

Comparisons with state-of-the-art methods

The performance comparison between several previous excellent algorithms and our proposed method on benchmark datasets is provided in Table 10. With the help of the rectification module, SCM, GCM, and the transformer-based decoder, our proposed text recognizer achieves higher recognition results on three datasets, namely, the two irregular datasets IC15, SVTP, and the regular dataset IC13. This finding indicates that the rectification module effectively rectifies the shape of irregular text. Moreover, we assert that context modeling plays an important role in recognizing low-resolution images, such as images in SVTP. Meanwhile, the visual features extracted by CNN are important complements to the global semantic information, achieving 2%, 0.7%, and 3.1% improvement on the IC13, IC15, and SVTP datasets, respectively, compared with second best results in Table 10.

Table 10 Comparison of our model among various widely used benchmarks with some state-of-art methods

Conclusions

In this study, we assert that context modeling plays a significant role in a robust and efficient text recognizer. We propose a novel model that combines CNN and a lightweight transformer-based decoder for text recognition. This model can be trained in parallel and exhibits the capability for global computation. We use the SCM to improve full-image information-carrying capability, while the GCM is integrated to enhance long-range dependencies. Moreover, a Chinese text dataset is proposed for solving Chinese character recognition, particularly in express sheet images. In addition, the function that constructs our proposed dataset may provide inspiration for other text recognition scenes given that natural scenes are complex. The model performs well on five publicly available datasets, particularly in irregular and low-resolution cases. Inspired by the performance of language models in the field of NLP, we will add language model for multi-modal learning to correct too blurred characters in the future.