Abstract
In this paper, we present a page classification application in a banking workflow. The proposed architecture represents administrative document images by merging visual and textual descriptions. The visual description is based on a hierarchical representation of the pixel intensity distribution. The textual description uses latent semantic analysis to represent document content as a mixture of topics. Several off-the-shelf classifiers and different strategies for combining visual and textual cues have been evaluated. A final step uses an \(n\)-gram model of the page stream allowing a finer-grained classification of pages. The proposed method has been tested in a real large-scale environment and we report results on a dataset of 70,000 pages.
Similar content being viewed by others
Notes
ABBYY Finereader Engine 9.
References
Aggarwal, C., Zhai, C.: Mining Text Data, Chap. A Survey of Text Classification Algorithms. Springer, New York (2012)
Augereau, O., Journet, N., Vialard, A., Domenger, J.: Improving classification of an industrial document image database by combining visual and textual features. In: Proceedings of the Eleventh IAPR International Workshop on Document Analysis Systems (2014)
Bagdanov, A.: Fine-grained document genre classification using first order random graphs. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, pp. 79–83 (2001)
van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.: Distance measures for layout-based document image retrieval. In: Proceedings of the International Conference on Document Image Analysis for Libraries (2006)
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cesarini, F., Lastri, M., Marinai, S., Soda, G.: Encoding of modified X-Y trees for document classification. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, pp. 1131–1136 (2001)
Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011)
Chen, N., Blostein, D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Document Anal. Recognit. 10(1), 1–16 (2006)
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Dengel, A., Dubiel, F.: Computer understanding of document structure. Int. J. Imaging Syst. Technol. 7(4), 271–278 (1996)
Dimmick, D., Garris, M., Wilson, C.L.: Structured forms database. Tech. rep, National Institutte of Standards and Technology (1991)
Doermann, D.: The indexing and retrieval of document images: a survey. Comput. Vis. Image Underst. 70(3), 287–298 (1998)
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley-Interscience, New York (2000)
Erol, B., Hull, J.: Semantic classification of business images. In: Electronic Imaging, pp. 60,730G–60,730G (2006)
Ford, G., Thoma, G.: Ground truth data for document image analysis. In: Proceedings of the Symposium on Document Image Understanding and Technology, pp. 199–205 (2003)
Gaceb, D., Eglin, V., Lebourgeois, F.: Classification of business documents for real-time application. J. Real-time Image Process. (2011). doi:10.1007/s11554-011-0227-4
Gordo, A., Gibert, J., Valveny, E., Rusiñol, M.: A kernel-based approach to document retrieval. In: Proceedings of the Ninth IAPR International Workshop on Document Analysis Systems, pp. 377–384 (2010)
Gordo, A., Perronnin, F.: A bag-of-pages approach to unordered multi-page document classification. In: International Conference on Pattern Recognition, pp. 1920–1923 (2010)
Gordo, A., Perronnin, F., Valveny, E.: Document classification using multiple views. In: Proceedings of the Tenth IAPR International Workshop on Document Analysis Systems, pp. 33–37 (2012)
Gordo, A., Perronnin, F., Valveny, E.: Large-scale document image retrieval and classification with runlength histograms and binary embeddings. Pattern Recognit 46(7), 1898–1905 (2013)
Gordo, A., Rusiñol, M., Karatzas, D., Bagdanov, A.: Document classification and page stream segmentation for digital mailroom applications. In: International Conference on Document Analysis and Recognition (2013)
Hamza, H., Belaïd, Y., Belaïd, A., Chaudhuri, B.: An end-to-end administrative document analysis system. In: Proceedings of the Fourteenth International Conference on Pattern Recognition, pp. 175–182 (2008)
van der Heijden, F., Duin, R., de Ridder, D., Tax, D.: Classification, Parameter Estimation and State Estimation—An Engineering Approach Using Matlab. Wiley, New York (2004)
Héroux, P., Diana, S., Ribert, A., Trupin, E.: Classification method study for automatic form class identification. In: Proceedings of the Fourteenth International Conference on Pattern Recognition, pp. 926–928 (1998)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the Twenty second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)
Meilender, T., Belaïd, A.: Segmentation of continuous document flow by a modified backward-forward algorithm. In: Proceedings of the Document Recognition and Retrieval (2009)
Misue, K., Sakakibara, Y.: Building of a document classification tree by recursive optimization of keyword selection function. US Patent US5463773 A (1995).
Odone, F., Barla, A., Verri, A.: Building kernels from binary strings for image matching. IEEE Trans. Image Process. 14(2), 169–180 (2005)
Porter, M.: Snowball: a language for stemming algorithms (2001)
Rangoni, Y., Belaïd, A., Vajda, S.: Labelling logical structures of document images using a dynamic perceptive neural network. Int. J. Document Anal. Recognit. 15(1), 45–55 (2012)
Řehůřek, R.: Subspace tracking for latent semantic analysis. In: Proceedings of the 33rd European Conference on Information Retrieval Research, pp. 289–300 (2011)
Rusiñol, M., Karatzas, D., Bagdanov, A.D., Llados, J.: Multipage document retrieval by textual and visual representations. In: International Conference on Pattern Recognition, pp. 521–524 (2012)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988). doi:10.1016/0306-4573(88)90021-0
Sarkar, P.: Image classification: classifying distributions of visual features. In: Proceedings of the International Conference on Pattern Recognition (2006)
Schmidtler, M., Amtrup, J.: Automatic document separation: a combination of probabilistic classification and finite-state sequence modeling. In: Natural Language Processing and Text Mining, pp. 123–144 (2006)
Sebsatiani, F.: Machine learning in automated text categorization. J. ACM Comput. Surv. 34(1), 1–47 (2002)
Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure-based features. Int. J. Document Anal. Recognit. 3(4), 232–247 (2001)
Sidiropoulos, P., Vrochidis, S., Kompatsiaris, I.: Content-based binary image retrieval using the adaptive hierarchical density histogram. Pattern Recognit. 44(4), 739–750 (2011)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Yang, Y., Pederson, J.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420 (1997)
Young, S., Russell, N., Thornton, J.: Token passing: A simple conceptual model for connected speech recognition systems. Tech. Rep. CUED/F-INFENG/TR38, Cambridge University (1998)
Acknowledgments
This work has been partially supported by the Spanish Ministry of Education and Science under projects TIN2011-24631, TIN2012-37475-C02-02, RYC-2009-05031 and RYC-2012-11776; by the People Programme (Marie Curie Actions) of the Seventh Framework Programme of the European Union (FP7/2007-2013) under REA grant agreement no. 600388, and by the Agency of Competitiveness for Companies of the Government of Catalonia, ACCIÓ; and by the CREST project from Japan Society for the Promotion of Science (JSPS).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rusiñol, M., Frinken, V., Karatzas, D. et al. Multimodal page classification in administrative document image streams. IJDAR 17, 331–341 (2014). https://doi.org/10.1007/s10032-014-0225-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-014-0225-8