Abstract
Word searching in non-structural layout such as graphical documents is a difficult task due to arbitrary orientations of text words and the presence of graphical symbols. This paper presents an efficient approach for word searching in documents of non-structural layout using an efficient indexing and retrieval approach. The proposed indexing scheme stores spatial information of text characters of a document using a character spatial feature table (CSFT). The spatial feature of text component is derived from the neighbor component information. The character labeling of a multi-scaled and multi-oriented component is performed using support vector machines. For searching purpose, the positional information of characters is obtained from the query string by splitting it into possible combinations of character pairs. Each of these character pairs searches the position of corresponding text in document with the help of CSFT. Next, the searched text components are joined and formed into sequence by spatial information matching. String matching algorithm is performed to match the query word with the character pair sequence in documents. The experimental results are presented on two different datasets of graphical documents: maps dataset and seal/logo image dataset. The results show that the method is efficient to search query word from unconstrained document layouts of arbitrary orientation.
Similar content being viewed by others
References
Mantas, J.: An overview of character recognition methodologies. Pattern Recognit. 19(6), 425–430 (1986)
Roy, P.P., Pal, U., Lladós, J.: Query driven word retrieval in graphical documents. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, Boston, USA, pp. 191–198 (2010)
Liu, W., Lladós, J. (eds.): Graphics Recognition. Ten Years Review and Future Perspectives. Springer, Berlin (2006)
Roy, P.P., Pal, U., Lladós, J.: Document seal detection using GHT and character proximity graphs. Pattern Recognit. 44, 1282–1295 (2011)
Fletcher, L.A., Kasturi, R.: A robust algorithm for text string separation from mixed text/graphics images. IEEE Trans. Pattern Anal. Mach. Intell. 10(6), 910–918 (1988)
Deseilligny, M.P., Men, H.L., Stamon, G.: Character string recognition on maps, a rotation-invariant recognition method. Pattern Recognit. Lett. 16(12), 1297–1310 (1995)
Adam, S., Ogier, J.M., Carlon, C., Mullot, R., Labiche, J., Gardes, J.: Symbol and character recognition: application to engineering drawing. Int. J. Doc. Anal. Recognit. 3, 89–101 (2000)
Pal, U., Roy, P.P., Tripathy, N., Llados, J.: Multi-oriented Bangla and Devnagari text recognition. Pattern Recognit. 43, 4124–4136 (2010)
Nakayama, T.: Modeling content identification from document images. In: Fourth Confernce on Applied Natural Language Processing, pp. 22–27 (1994)
Lu, S., Linlin, L., Tan, C.L.: Document image retrieval through word shape coding. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1913–1918 (2008)
Williams, W.J., Zalubas, E., Hero, A.O.: Word spotting in bitmapped fax documents. Inf. Retr. 2, 207–226 (2000)
Rath, T., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recognit. 9, 139–152 (2005)
Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, I., Theodoridis, S., Perantonis, S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Doc. Anal. Recognit. 9, 167–177 (2007)
Gatos, B., Pratikakis, I.: Segmentation-free word spotting in historical printed documents. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 271–275 (2009)
Takasu, A.: Document filtering for fast approximate string matching of errorneous text. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 916–920 (2001)
Tombre, K., Tabbone, S., Peissier, L., Lamiroy, B., Dosch, P.: Text /graphics separation revisited. In: Proceedings of IAPR Workshop on Document Analysis Systems, pp. 200–211 (2002)
Luo, H., Agam, G., Dinstein, I.: Directional mathematical morphology approach for line thinning and extraction of character strings from maps and line drawings. In: Proceedings of International Conference on Document Analysis and Recognition, Vol. 1, pp. 257–261 (1995)
Tan, C.L., Ng, P.O.: Text extraction using pyramid. Pattern Recognit. 31(1), 63–72 (1998)
Cao, R., Tan, C.: Text/graphics separation in maps. In: Proceedings of GREC, Canada, pp. 44–48 (2001)
Goto, H., Aso, H.: Extracting curved lines using local linearity of the text line. Int. J. Doc. Anal. Recognit. 2, 111–118 (1999)
Loo, P. K., Tan, C.L.: Word and sentence extraction using irregular pyramid. In: Proceedings of IAPR Workshop on Document Analysis Systems, pp. 307–318 (2002)
Roy, P.P., Pal, U., Lladós, J.: Text line extraction in graphical documents using background and foreground information. Int. J. Doc. Anal. Recognit. 15, 227–241 (2012)
Gatos, B., Pratikakis, I., Ntirogiannis, K.: Segmentation based recovery of arbitrarily warped document images. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 989–993 (2007)
Bai, N.N., Nam, K., Song, Y.: Extracting curved text lines using the chain composition and the expanded grouping method. In: Document Recognition and Retrieval, USA (2008)
Hase, H., Shinokawa, T., Yoneda, M., Suen, C.Y.: Recognition of rotated characters by eigen-space. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 731–735 (2003)
Yang, T.N., Wang, S.D.: A rotation invariant printed chinese character recognition system source. Pattern Recognit. Lett. 22, 85–95 (2001)
Hayashi, T., Takagi, N.: A consideration on rotation invariant character recognition. In: World Automation Congress, pp. 1–6 (2006)
Monwar, M., Haque, W., Paul, P.P.: A new approach for rotation invariant optical character recognition using eigen digit. In: Proceedings of Canadian Conference on Electrical and Computer Engineering, pp. 1317–1320 (2007)
Iwamura, M., Tsuji, T., Horimatsu, A., Kise, K.: Real-time camera-based recognition of characters and pictograms. In: Proceedings of International Conference on Document Analysis and Recognition, Spain, pp. 76–80 (2009)
Roy, P.P., Pal, U., Lladós, J., Delalandre, M.: Multi-oriented and multi-sized touching character segmentation using dynamic programming. In: Proceedings of International Conference on Document Analysis and Recognition, Barcelona, Spain, pp. 11–15 (2009)
Hall, P.A.V., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. 12, 381–402 (1980)
Liu, H., Lu, Y., Wu, Q., Zha, H.: Automatic seal image retrieval method by using shape features of chinese characters. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pp. 2871–2876 (2007)
Huet, B., Hancock, E.R.: Relational object recognition from large structural libraries. Pattern Recognit. 35, 1895–1915 (2002)
Delalandre, M., Pridmore, T., Valveny, E., Locteau, H., Trupin, E.: Building synthetic graphical documents for performance evaluation, Revised Selected Papers of Workshop on Graphics Recognition (GREC). Lecture Notes in Computer Science (LNCS) 5046, pp. 288–298 (2008)
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979)
Roy, K., Pal, U., Chaudhuri, B.B.: A system for joining and recognition of broken bangla numerals for Indian postal automation. In: Indian Conference on Computer Vision, Graphics and Image Processing, pp. 641–646 (2004)
Roy, P.P., Lladós, J., Pal, U.: A complete system for detection and recognition of text in graphical documents using background information. In: International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, pp. 209–216 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Roy, P.P., Pal, U. & Lladós, J. Word searching in unconstrained layout using character pair coding. IJDAR 17, 343–358 (2014). https://doi.org/10.1007/s10032-014-0227-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-014-0227-6