Abstract
In this paper, we present document information content (i.e. text fields) extraction technique via graph mining. Real-world users first provide a set of key text fields from the document image which they think are important. These fields are used to initialise a graph where nodes are labelled with the field names in addition to other features such as size, type and number of words, and edges are attributed with relative positioning between them. Such an attributed relational graph is then used to mine similar graphs from document images which are used to update the initial graph iteratively each time we extract them, to produce a graph model. Graph models, therefore, are employed in the absence of users. We have validated the proposed technique and evaluated its scientific impact on real-world industrial problem with the performance of 86.64 % precision and 90.80 % recall by considering all zones, viz. header, body and footer. More specifically, the proposed technique is well suited for table processing (i.e. extracting repeated patterns from the table) and it outperforms the state-of-the-art method by approximately more than 3 %.
Similar content being viewed by others
Notes
References
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993)
Aiello, M., Monz, C., Todoran, L.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002)
Aksoy, S.: Spatial relationship models for image information mining. Global Earth Observation System of Systems—Summer School on Advancing Earth Observation Data Understanding, Romania (2009)
Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of International Workshop on Document Analysis Systems, pp. 175–182 (2010)
Belaïd, A., Belaïd, Y., Valverde, L.N., Kebairi, S.: Adaptive technology for mail-order form segmentation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 689–693 (2001)
Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recogn. Lett. 19(3–4), 255–259 (1998)
Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. In: Proceedings of International Conference on Pattern Recognition, pp. 236–240 (2002)
Chandran, S., Kasturi, R.: Structural recognition of tabulated data. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 516–519 (1993)
Chen, J., Lopresti, D.P.: Table detection in noisy off-line handwritten documents. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 399–403 (2011)
Cook, D., Holder, L.: Graph-based data mining. IEEE Intell. Syst. 15(2), 32–41 (2000)
Coüasnon, B.: Dmos, a generic document recognition method: application to table structure analysis in a general and in a specific way. Int. J. Doc. Anal. Recogn. 8(2–3), 111–122 (2006)
Diane, D.J., Cook, L.B.: Mining Graph Data. Wiley-Interscience, London (2006)
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Doermann, D., Tombre, K.: Handbook of Document Image Processing and Recognition. Springer, New York (2013)
e Silva, A.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recogn. 8(2–3), 144–171 (2006)
Embley, D.W., Hurst, M., Lopresti, D.P., Nagy, G.: Table-processing paradigms: a research survey. Int. J. Doc. Anal. Recogn. 8(2–3), 66–86 (2006)
Gallagher, B.: Matching structure and semantics: a survey on graph-based pattern matching. In: AAAI FS ’06: Papers from the 2006 AAAI Fall Symposium on Capturing and Using Patterns for Evidence Detection, pp. 45–53 (2006)
Garey, M.R., Johnson, D.S.: Computers and intractability; a guide to the theory of NP-completeness. W. H. Freeman & Co., New York (1990)
Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic Table Detection in Document Images. Springer, Berlin (2005)
Giugno, R., Shasha, D.: Graphgrep: a fast and universal method for querying graphs. In: Proceedings of International Conference on Pattern Recognition, pp. 112–115 (2002)
Green, E., Krishnamoorthy, M.: Model-based analysis of printed tables. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 214–217 (1995)
Hamza, H., Belaïd, Y., Belaïd, A.: Case-based reasoning for invoice analysis and recognition. In: Weber, R., Richter, M.M. (eds.) International Conference on Case-Based Reasoning, Volume 4626 of Lecture Notes in Computer Science, pp. 404–418 (2007)
Hamza, H., Belaïd, Y., Belaïd, A., Chaudhuri, B.B.: An end-to-end administrative document analysis system. In: Proceedings of International Workshop on Document Analysis Systems, pp. 175–182 (2008)
Hassan, T.: User-guided wrapping of pdf documents using graph matching techniques. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 631–635 (2009)
Hassan, T., Baumgartner, R.: Table recognition and understanding from pdf files. In Proceedings of International Conference on Document Analysis and Recognition, pp. 1143–1147 (2007)
Hori, O., Doermann, D.S.: Robust table-form structure analysis based on box-driven reasoning. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 218–221 (1995)
Hu, J., Kashi, R.S., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Proceedings of SPIE Conference on Document Recognition and Retrieval, pp. 291–302 (2000)
Hu, J., Kashi, R.S., Lopresti, D.P., Wilfong, G.T.: Evaluating the performance of table processing algorithms. Int. J. Doc. Anal. Recogn. 4(3), 140–153 (2002)
Hurst, M.: A constraint-based approach to table structure derivation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 911–915 (2003)
Hurst, M.: Towards a theory of tables. Int. J. Doc. Anal. Recogn. 8(2–3), 123–131 (2006)
Kasturi, R., O’Gorman, L., Govindaraju, V.: Document image analysis: a primer. Char. Recogn. 27(1), 3–22 (2002)
Kieninger, T., Dengel, A.: The t-recs table recognition and analysis system. In: Lee, S.-W., Nakano, Y. (eds.) Proceedings of International Workshop on Document Analysis Systems, Volume 1655 of Lecture Notes in Computer Science, pp. 255–269. Springer, Berlin (1998)
Kieninger, T., Dengel, A.: Applying the t-recs table recognition system to the business letter domain. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 518–522 (2001)
Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceedings of SPIE, Document Recognition V, vol. 3305, pp. 22–32 (1998)
Klein, B., Gokkus, S., Kieninger, T., Dengel, A.: Three approaches to “industrial” table spotting. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 513–517 (2001)
Klein, B., Agne, S., Dengel, A.: Results of a study on invoice-reading systems in germany. In: Marinai, S., Dengel, A. (eds.) Proceedings of International Workshop on Document Analysis Systems, Volume 3163 of Lecture Notes in Computer Science, pp. 451–462. Springer, Berlin (2004)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8 (1966)
Li, Y., Liu, B.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)
Liang, J., Haralick, R.M., Phillips, I.T.: A statistically based, highly accurate text-line segmentation method. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 551–554 (1999)
Lopresti, D.P., Nagy, G.: A tabular survey of automated table processing. In: Chhabra, A.K., Dori, D. (eds.) Graphics Recognition, Lecture Notes in Computer Science Series, Volume 1941 of Lecture Notes in Computer Science, pp. 93–120. Springer, Berlin (1999)
Mandal, S., Chowdhury, S.P., Das, A.K., Chanda, B.: A simple and effective table detection system from document images. Int. J. Doc. Anal. Recogn. 8(2–3), 172–182 (2006)
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Kanungo, T., Smith, E.H.B., Hu, J., Kantor, P.B. (eds.) Proceedings of SPIE Conference on Document Recognition and Retrieval, vol. 5010, pp. 197–207 (2003)
Messmer, B.T., Bunke, H.: Subgraph isomorphism in polynomial time. Technical report, Institute of Computer Science and Applied Math, University of Bern (1995)
Nagy, G.: Twenty years of document image analysis in pami. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000)
Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. Int. J. Geogr. Inf. Sci. 11(2), 111–138 (1997)
Ramel, J.-Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 374–378 (2003)
Riesen, K., Bunke, H.: Graph Classification and Clustering Based on Vector Space Embedding. World Scientific Publishing Co. Inc., River Edge (2010)
Santosh, K.C., Belaïd, A.: Client-driven content extraction associated with table. In: Machine Vision and Applications, pp. 277–280 (2013)
Santosh, K.C., Belaïd, A.: Document information extraction and its evaluation based on client’s relevance. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 35–39 (2013)
Santosh, K.C., Belaïd, A.: Pattern-based approach to table extraction. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds.) Proceedings of the IAPR Iberian Conference on Pattern Recognition and Image Analysis, Volume 7887 of Lecture Notes in Computer Science, pp. 766–773. Springer, Berlin (2013)
Saund, E.: A graph lattice approach to maintaining and learning dense collections of subgraphs as image features. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2323–2339 (2013)
Shafait, F., Smith, R.: Table detection in heterogeneous documents. In: Doermann, D.S., Govindaraju, V., Lopresti, D.P., Natarajan, P. (eds.) Proceedings of International Workshop on Document Analysis Systems, pp. 65–72 (2010)
Shamilian, J.H., Baird, H.S., Wood, T.L.: A retargetable table reader. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 158–163 (1997)
Smith, R.W.: Hybrid page layout analysis via tab-stop detection. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 241–245 (2009)
Stoer, M., Wagner, F.: A simple min-cut algorithm. J. ACM 44(4), 585–591 (1997)
Tsai, W.-H., Fu, K.-S.: Error-correcting isomorphisms of attributed relational graphs for pattern analysis. IEEE Trans. Syst. Man Cybern. 9(12), 757–768 (1979)
Ullmann, J.R.: An algorithm for sub-graph isomorphism. J. ACM 23(1), 31–42 (1976)
Wang, Y., Haralick, R.M., Phillips, I.T.: Automatic table ground truth generation and a background-analysis-based table structure extraction method. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 528–532 (2001)
Wang, Y., Phillips, I.T., Haralick, R.M.: Table detection via probability optimization. In: Proceedings of International Workshop on Document Analysis Systems, pp. 272–282
Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGKDD Explor. Newslett. 5(1), 59–68 (2003)
Watanabe, T., Luo, Q., Sugie, N.: Toward a practical document understanding of table-form documents: its framework and knowledge representation. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 510–515 (1993)
Weber, M., Liwicki, M., Dengel, A.: Faster subgraph isomorphism detection by well-founded total order indexing. Pattern Recogn. Lett. 33(15), 2011–2019 (2012)
Wenzel, C., Tersteegen, W.: Precise table recognition by making use of reference tables. In: Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice. Springer, Berlin, pp. 283–294 (1999)
Yan, X., Zhou, X.J., Han, J.: Mining closed relational graphs with connectivity constraints. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 324–333 (2005)
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Int. J. Doc. Anal. Recogn. 7(1), 1–16 (2004)
Acknowledgments
The author would like to thank Prof. Belaid, Abdel for his suggestions and his administrative assistance on the project assigned by ITESOFT, France. The authors would like to thank the National Institutes of Health (NIH) Fellows Editorial Board for their editorial assistance.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
None declared.
Rights and permissions
About this article
Cite this article
Santosh, K.C. g-DICE: graph mining-based document information content exploitation. IJDAR 18, 337–355 (2015). https://doi.org/10.1007/s10032-015-0253-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-015-0253-z