FCN-biLSTM Based VAT Invoice Recognition and Processing

Jiang, Fei; Chen, Huan; Zhang, Liang-Jie

doi:10.1007/978-3-319-94340-4_11

FCN-biLSTM Based VAT Invoice Recognition and Processing

Fei Jiang^17,18,
Huan Chen^17,18 &
Liang-Jie Zhang^17,18

Conference paper
First Online: 16 June 2018

1898 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10973))

Abstract

Financial Sharing Centre of big or medium-sized enterprises that need to handle a large number of VAT invoices every day, but these invoices are often handled manually in poor efficiency. They need automation of unsupervised processing systems for VAT invoices to reduce costs and also to promote their financial management capability. In this paper, we develop FCN-biLSTMs that are capable of processing and recognizing invoice automatically. In view of the characteristics of invoice, we propose the methods that extract text lines by using invoice layout information and text characteristics, and achieve higher accuracy. Combined with the previous text detection methods and the attention-based biLSTM sequence learning structure for text recognizing, we developed an automatic VAT invoice recognition and processing system. The system in the actual projects of enterprises has achieved impressive performance.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Chinese domestic value-added tax (VAT) invoice is an important accounting and billing document and is a corporate tax certificate, and it is widely present in dealings among enterprises. The format of it is under strict control of State Administration of Taxation. Financial Sharing Centre of big or medium-sized enterprises need to handle a large number of VAT invoices every day, but these invoices are often handled manually in poor efficiency. They need automation of unsupervised processing systems for VAT invoices to reduce costs and also to promote their financial management capability [1]. There are some projects of this kind that have been built or have been bringing forth to build. The undergoing of an enterprise internal ERP plans is providing a good infrastructure for it, and also, the developing of image processing technologies such as text detection, text recognition and others are coming into a state of commercial feasibility for it, with some extra efforts we can turn the VAT invoice image recognition and processing automation into reality.

Due to the large variability of text patterns and the highly complicated background, the recognition and processing for photo VAT invoice images are much more challenging than the scanned ones. An overview of the network architecture is presented in Fig. 1. It consists of a number of convolutional layers, corner points of text bounding boxes, segmentation maps for text, and layout information for regressing the text box locations, encoder for embedding proposals of varying sizes to fixed-length vectors, and an attention-based Long Short-Term Memory (LSTM) decoder for word recognition. Via this framework, an automatic VAT invoice recognition and processing system is built and implemented.

We validate the effectiveness of our method on our accumulated VAT invoice image datasets in the enterprise financial management scenario. The results show the advantages of the proposed algorithm in accuracy and applicability.

The contributions of this paper are three-fold: (1) We propose a unified framework for processing and recognizing the VAT invoices, which can be trained and evaluated end-to-end. (2) Our method can simultaneously handle the challenges (such as rotation, varying aspect ratios, very close instances) in multi-oriented text in VAT invoice images. (3) We take invoice layout information into consideration and use some rule to regress and constrain the text bounding boxes.

2 Related Work

An automatic VAT invoice recognition and processing system essentially includes two tasks: text detection and word recognition. In this section, we present a brief introduction to related works on text detection, word recognition, and text spotting systems for VAT invoice that combine both. The text detection algorithm has developed rapidly in recent years. It can be roughly classified into two categories: horizontal text detection and skew text detection. For horizontal text detection, a number of approaches are proposed to detect words directly in the images using DNN based techniques, and it is similar to the method of object detection. Tian et al. [2] develop a vertical anchor mechanism, and propose a Connectionist Text Proposal Network (CTPN) to accurately localize text lines in image at ECCV 2016. The latest approach to skew text detection is SegLink [3] and Corner Localization and Region Segmentation proposed by Lyu [4]. SegLink [3] predicts text segments and the linkage of them in a SSD style network and links the segments to text boxes, in order to handle long oriented text in natural scene. Lyu et al. [4] propose to detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. Word recognition has not made much progress in the last two years. There are two main methods, one of the methods is proposed by Shi et al. [5]. It is a novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, while the another method is presented by Lee et al. [6] which use recursive recurrent neural networks with attention modeling for lexicon-free optical character recognition in natural scene images. Text spotting needs to handle both text detection and word recognition. Li et al. [7] proposed a unified network that simultaneously localizes and recognizes text with a single forward pass, avoiding intermediate processes like image cropping and feature re-calculation, word separation, or character grouping. Combining with specific application scenarios, Xie et al. [1] proposed to use many traditional images processing technology to develop the invoice automatic recognition and processing system.

3 Approach

3.1 Overall Architecture

The whole system architecture is illustrated in Fig. 1. It includes two parts: text detection network (TDN) and text recognition network (TRN). Text detection network aims to localize text in images and generate bounding boxes for words. Text recognition network recognizes words in the detected bounding boxes based on the previous text detection network. Our model is motivated by recent progresses in FPN [8], DSSD [9], Instance FCN models [10] and sequence-to-sequence learning [11, 12], and we also take the special characteristics of text and invoice layout information into consideration. In this section, we present a detailed description of the whole system.

3.2 Text Detection Network

The network of our method is a fully convolutional network (FCN) that plays the roles of feature extraction, corner detection, position-sensitive segmentation and fully convolutional segmentation. Inspired by the good performance achieved by FPN [8] and DSSD [9], we adopt the backbone in FPN/DSSD architecture to extract features. In detail, we convert the fc6 and fc7 in the VGG16 to convolutional layers and name them conv6 and conv7 respectively. Then several extra convolutional layers (conv8, conv9, conv10, conv11) are stacked above conv7 to enlarge the receptive fields of extracted features. After that, a few deconvolution modules proposed in DSSD [9] are used in a top-down pathway (Fig. 2). Particularly, to detect text with different sizes well, we cascade deconvolution modules with 256 channels from conv11 to conv3 (the features from conv10, conv9, conv8, conv7, conv4, conv3 are reused), and 6 deconvolution modules are built in total. Including the features of conv11, we name those output features F3, F4, F7, F8, F9, F10 and F11 for convenience. In the end, the feature extracted by conv11 and deconvolution modules, which have richer feature representations, are used to detect corner points and predict position-sensitive maps. A large number of candidate bounding boxes can be generated after sampling and grouping corner points. Inspired by [4], we adopt the methods which score the candidate boxes by Rotated Position-Sensitive Average ROI Pooling and detect the arbitrary-oriented text by using position-sensitive segmentation maps.

But unlike the above methods [4] that regress text boxes or segments directly, we still added the supplementary method, which uses the invoice layout information in the image (such as the form line, red chop and two-dimensional code.) detected by FCN architecture [13] to constrain the detected bounding boxes and to improve the accuracy and efficiency for text detection. Combine with the above method, we use the NMS and some rules to filter out the candidate boxes with low score and get the RoIs. The detected bounding boxes are merged via NMS according to their textness scores and fed into Text Recognition Network (TRN) for text recognition.

3.3 Text Recognition Network

To process RoIs of different scales and aspect ratios in a unified way, most existing works re-sample regions into fixed-size feature maps via pooling [14]. However, for text, this approach may lead to significant distortion due to the large variation of word lengths. For example, it may be unreasonable to encode short words like “Dr” and long words like “congratulations” into feature maps of the same size. In this work, we propose to re-sample regions according to their respective aspect ratios, and then use RNNs to encode the resulting feature maps of different lengths into fixed length vectors. The whole region feature encoding process is illustrated in Fig. 3.

For an RoI of size h $ \times $ w, we perform spatial max-pooling with a resulting size of

$$ H \times {\text{min}}({\text{W}}_{ \hbox{max} } ,2{\text{Hw}}/{\text{h}}) , $$

(1)

where the expected height H is fixed and the width is adjusted to keep the aspect ratio as 2w/h (twice the original aspect ratio) unless it exceeds the maximum length W_max. Note that here we employ a pooling window with an aspect ratio of 1:2, which benefits the recognition of narrow shaped characters, like ‘i’, ‘l’, etc., as stated in [5].

Next, the resampled feature maps are considered as a sequence and fed into RNNs for encoding. Here we use Long-Short Term Memory (LSTM) [11] instead of vanilla RNN to overcome the shortcoming of gradient vanishing or exploding. The feature maps after the above varying-size RoI pooling are denoted as $ {\mathbf{Q}} \in {\text{R}}^{C \times H \times W} $, where $ {\text{W}} = { \hbox{min} }\left( {{\text{W}}_{max} , 2 {\text{H}}w/{\text{h}}} \right) $ is the number of columns and C is the channel size. We flatten the features in each column, and obtain a sequence $ {\mathbf{q}}_{1} , \ldots ,{\mathbf{q}}_{w} \in \varvec{R}^{C \times H} $ which are fed into LSTMs one by one. Each time LSTM units receive one column of feature q_t, and update their hidden state h_t by a non-linear function: h_t = f (q_t, h_t−1). In this recurrent fashion, the final hidden state h_W (with size R = 1024) captures the holistic information of Q and is used as a RoI representation with fixed dimension.

Text recognition aims to predict the text in the detected bounding boxes based on the extracted region features. As shown in Fig. 4, we adopt LSTMs with attention mechanism [12, 15] to decode the sequential features into words.

Firstly, hidden states at all steps $ {\mathbf{h}}_{ 1} , \ldots ,{\mathbf{h}}_{\text{W}} $ from RFE are fed into an additional layer of LSTM encoder with 1024 units. We record the hidden state at each time step and form a sequence of $ {\mathbf{V}}\, = \,\left[ {{\mathbf{v}}_{ 1} , \ldots ,{\mathbf{v}}_{\text{W}} } \right] \in \varvec{R}^{R \times W} $. It includes local information at each time step and works as the context for the attention model.

As for decoder LSTMs, the ground-truth word label is adopted as input during training. It can be regarded as a sequence of tokens $ {\text{s}} = \left\{ {{\mathbf{s}}0,{\mathbf{s}} 1, \ldots ,{\mathbf{s}}_{{{\text{T}} + 1}} } \right\} $ where s₀ and s_T+1 represent the special tokens START and END respectively. We feed decoder LSTMs with T + 2 vectors: $ {\text{x}}_{0} ,{\text{x}}_{ 1} , \ldots ,{\text{x}}_{{{\text{T}} + 1}} $, where $ {\text{x}}_{0} = [{\mathbf{v}}_{\text{W}} ;{\text{Atten}}({\mathbf{V}}, \, 0)] $ is the concatenation of the encoder’s last hidden state v_W and the attention output with guidance equals to zero; and $ {\text{x}}_{i} = [\psi (s_{i - 1} ;Atten(\varvec{V},\varvec{h}_{i - 1}^{'} )] $, for $ {\text{i}} = 1, \ldots ,{\text{T}} + 1 $ , is made up of the embedding $ \psi () $ of the (i − 1)-th token s_i−1 and the attention output guided by the hidden state of decoder LSTMs in the previous time-step $ \varvec{h}_{i - 1}^{'} $. The embedding function $ \psi () $ is defined as a linear layer followed by a tanh non-linearity.

The attention function $ {\mathbf{c}}_{\text{i}} \, = \,{\text{Atten}}({\mathbf{V}},{\mathbf{h}}_{\varvec{i}}^{\varvec{'}} ) $ is defined as follows:

$$ \left\{ {\begin{array}{*{20}l} {g_{j} = \tanh \left( {W_{v} v_{j} + W_{h} h_{i}^{'} } \right),\,j = 1, \ldots ,W,} \hfill \\ {\alpha = softmax\left( {w_{g}^{T} \bullet [g_{1} ,g_{2} \ldots ,g_{w} ]} \right),} \hfill \\ {c_{i} = \sum\nolimits_{j = 1}^{W} {\alpha_{j} v_{j} } } \hfill \\ \end{array} } \right. $$

(2)

where $ {\mathbf{V}}\, = \,\left[ {{\mathbf{v}}_{ 1} , \ldots ,{\mathbf{v}}_{\text{W}} } \right] $ is the variable-length sequence of features to be attended, $ \varvec{h}_{i}^{'} $ is the guidance vector, W_v and W_h are linear embedding weights to be learned, $ \alpha $ is the attention weights of size W, and c_i is a weighted sum of input features.

At each time-step $ {\text{t}} = 0,{ 1}, \ldots ,{\text{T}} + 1 $, the decoder LSTMs compute their hidden state $ \varvec{h}_{t}^{'} $ and output vector y_t as follows:

$$ \left\{ {\begin{array}{*{20}l} {\varvec{h}_{t}^{'} = f(x_{t} ,\varvec{h}_{t - 1}^{'} )} \hfill \\ {\varvec{y}_{t} = \varphi \left( {\varvec{h}_{t}^{'} } \right) = softmax(W_{o} \varvec{h}_{t}^{'} )} \hfill \\ \end{array} } \right. $$

(3)

Where the LSTM [11] is used for the recurrence formula f(), and Wo linearly transforms hidden states to the output space, including 26 case-insensitive characters, 10 digits, common standard Chinese characters, a token representing all punctuations like “!” and “?”, and a special END token.

At test time, the token with the highest probability in previous output y_t is selected as the input token at step t + 1, instead of the ground-truth tokens $ {\text{s}}_{ 1} , \ldots ,{\text{s}}_{\text{T}} $.

The process is started with the START token, and repeated until we get the special END token.

3.4 Loss Functions and Training

As we demonstrate above, our system takes as input of an image, word bounding boxes and their labels during training. For the final outputs of the whole system, we apply a multi-task loss for both detection and recognition.

$$ {\text{L}} = L_{D} + L_{R} $$

(4)

Our text detect network model is trained by the corner detection and position-sensitive segmentation simultaneously. The loss function is defined as:

$$ L_{D} = \frac{1}{{N_{c} }}L_{conf} + \frac{{\lambda_{1} }}{{N_{c} }}L_{loc} + \frac{{\lambda_{2} }}{{N_{s} }}L_{seg} $$

(5)

Where L_conf and L_loc are the loss functions of the score branch for predicting confidence score and the offset branch for localization in the module of corner point detection. L_seg is the loss function of position-sensitive segmentation. N_c is the number of positive default boxes, N_s is the number of pixels in segmentation maps. N_c and N_s are used to normalize the losses of corner point detection and segmentation. $ \lambda_{1} $ and $ \lambda_{2} $ are the balancing factors of the three tasks. In default, we set the $ \lambda_{1} $ to 1 and $ \lambda_{2} $ to 10.

We follow the strategy of text recognition which proposed by Lyu et al. [4] and the loss for training text recognition is.

$$ {\text{L}}_{R} = \frac{1}{{{\text{N}}_{c} }}\sum\nolimits_{i = 1}^{{N_{c} }} {L_{rec} (Y^{(i)} ,s^{(i)} } ) $$

(6)

Where s(i) is the ground-truth tokens for sample i and $ {\text{Y}}_{\left( i \right)} \, = \,\left\{ {{\text{y}}_{0}^{(i)} ,{\text{y}}_{1}^{(i)} , \ldots ,{\text{y}}_{T + 1}^{(i)} } \right\} $ is the corresponding output sequence of decoder LSTMs. $ {\text{Lrec}}({\text{Y}},{\text{s}}) = - \sum\nolimits_{t = 1}^{T + 1} {log\varvec{y}_{t} (s_{t} )} $ denotes the cross entropy loss on $ {\text{y}}_{ 1} , \ldots ,{\text{y}}_{{{\text{T}} + 1}} $, where y_t(s_t) represents the predicted probability of the output being s_t at time step t and the loss on y₀ is ignored.

4 Experiments

In this section, we perform experiments to verify the effectiveness of the proposed method. We use the accumulated VAT invoice image datasets in the enterprise financial management scenario to evaluate the proposed method.

Our method is implemented by using TensorFlow r1.4.1. All the experiments are carried out on a workstation with an Intel Xeon 8-core CPU (2.10 GHz), 2 GeForce GTX 1080 Graphics Cards, and 64 GB RAM. Running on 1 GPUs in parallel, training a batch takes about 1 s. The whole training process takes less than a day.

For different application scenarios of the invoice, scanned invoices and photo invoices achieves different F-measures. The photo invoices is easily influenced by some factors such as size, noise, blur, illumination, contrast and shelter. One contribution of this work is added to the supplementary method,which uses the invoice layout information in the image to improve the accuracy and efficiency of text detection. To validate its effectiveness, we compare the performance of models “Ours FCN-biLATM+NoLayout” and “Ours FCN-biLATM+Layout”. Experiment shows that the model with constrained layout rule significantly better than unconstrained layout rule. As illustrated in Tables 1 and 2, adopting constrained layout rule (“Ours FCN-biLATM+Layout”) instead of unconstrained layout rule (“Ours FCN-biLATM+NoLayout”) makes F-measures increase around 4%.

Table 1. Results on the scanned invoice image datasets. Precision (P) and Recall (R) at maximum F-measure (F) are reported in percentage.

Full size table

Table 2. Results on the photo invoice image datasets. Precision (P) and Recall (R) at maximum F-measure (F) are reported in percentage.

Full size table

5 Conclusion

In this paper, we have presented an automatic value-added tax (VAT) invoice recognition and processing system. In this system, VAT invoice can be detected and recognized in a single forward pass efficiently and accurately. Experimental results illustrate that the proposed method can produce an impressive performance in the actual projects of enterprises, and the model with constrained layout rule scenarios significantly better than unconstrained layout rule scenarios. One of potential future work is on maintaining images with other bills and documents.

References

Xie, Z.G.: Researches on unsupervised image processing of VAT invoices, (Master Thesis) Shanghai Jiao Tong University, Shanghai, China (2015)
Google Scholar
Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Chapter Google Scholar
Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Lyu, P., Yao, C., Wu, W., et al.: Multi-oriented scene text detection via corner localization and region segmentation. Journal (2018)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR, abs/1507.05717 (2015)
Google Scholar
Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: Computer Vision and Pattern Recognition, pp. 2231–2239. IEEE (2016)
Google Scholar
Li, H., Wang, P., Shen, C.: Towards end-to-end text spotting with convolutional recurrent neural networks. Journal (2017)
Google Scholar
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DSSD: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659 (2017)
Dai, J., He, K., Li, Y., Ren, S., Sun, J.: Instance-sensitive fully convolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 534–549. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_32
Chapter Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Shi, B., Wang, X., Lv, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE Conference on Computer Vision (2015)
Google Scholar
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015)
Google Scholar

Download references

Acknowledgement

This work is partially supported by the technical projects No. c1533411500138 and No. 2017YFB0802700.

Author information

Authors and Affiliations

National Engineering Research Center for Supporting Software of Enterprise Internet Services, Shenzhen, China
Fei Jiang, Huan Chen & Liang-Jie Zhang
Kingdee Research, Kingdee International Software Group Company Limited, Shenzhen, China
Fei Jiang, Huan Chen & Liang-Jie Zhang

Authors

Fei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Huan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Liang-Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Jiang .

Editor information

Editors and Affiliations

Shandong University, Jinan, China
Shijun Liu
Wageningen University, Wageningen, The Netherlands
Bedir Tekinerdogan
Nanzan University, Nagoya, Japan
Mikio Aoyama
Kingdee International Software Group CO., Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, F., Chen, H., Zhang, LJ. (2018). FCN-biLSTM Based VAT Invoice Recognition and Processing. In: Liu, S., Tekinerdogan, B., Aoyama, M., Zhang, LJ. (eds) Edge Computing – EDGE 2018. EDGE 2018. Lecture Notes in Computer Science(), vol 10973. Springer, Cham. https://doi.org/10.1007/978-3-319-94340-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-94340-4_11
Published: 16 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94339-8
Online ISBN: 978-3-319-94340-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics