Spotting Where to Read on Pages - Retrieval of Relevant Parts from Page Images

  • Koichi Kise
  • Masaaki Tsujino
  • Keinosuke Matsumoto
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2423)


This paper presents a new method of document image retrieval that is capable of spotting parts of page images relevant to a user’s query. This enables us to improve the usability of retrieval, since a user can find where to read on retrieved pages. The effectiveness of retrieval can also be improved because the method is little influenced by irrelevant parts on pages. The method is based on the assumption that parts of page images which densely contain keywords in a query are relevant to it. The characteristics of the proposed method are as follows: (1) Two-dimensional density distributions of keywords are calculated for ranking parts of page images, (2) The method relies only on the distribution of characters so as not to be affected by the errors of layout analysis. Based on the experimental results of retrieving Japanese newspaper articles, we have shown that the proposed method is superior to a method without the function of dealing with parts, and sometimes equivalent to a method of electronic document retrieval that works on error-free text.


Document Image Vector Space Model Inverse Document Frequency Document Retrieval Recall Level 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Doermann, D.: The Indexing and Retrieval of Document Images: A Survey, Computer Vision and Image Processing, Vol. 70, No. 3, pp.287–298, 1998.CrossRefGoogle Scholar
  2. 2.
    Ohta, M., Takasu, A., Adachi, J.: Retrieval Methods for English-Text with Missrecognized OCR Characters, Proc. of the 4th ICDAR, pp.957–961, 1997.Google Scholar
  3. 3.
    Smeaton, A. F., Spitz, A. L.: Using Character Shape Coding for Information Retrieval, Proc. of 4th ICDAR, pp.974–978, 1997.Google Scholar
  4. 4.
    Ohta, Y., Mori, R., Sakai, T.: Retrieval of Chinese Character Sequence Using Pictorial Features — The Case of Names on Visiting Cards —, Trans. IECE, Japan, Vol. J64-D, No. 11, pp.997–1004, 1981 (in Japanese).Google Scholar
  5. 5.
    Nakanishi, T, Omachi, S., Aso, H.: High Precision Keyword Search System Adapted to Low Quality Document Images, Tech. Report of IEICE, PRMU98-232, 1999 (in Japanese).Google Scholar
  6. 6.
    Salton, G., Singhal, A., Mitra, M.: Automatic Text Decomposition Using Text Segments and Text Themes, in Proc. Hypertext’ 96, pp.53–65, 1996.Google Scholar
  7. 7.
    Callan, J. P.: Passage-Level Evidence in Document Retrieval, in Proc. SIGIR’94, pp.302–310,1994.Google Scholar
  8. 8.
    Kise, K., Mizuno, H., Yamaguchi, M., Matsumoto, K.: On the Use of Density Distribution of Keywords for Automated Generation of Hypertext Links from Arbitrary Parts of Documents, in Proc. of the 5th ICDAR, pp.301–304, 1999.Google Scholar
  9. 9.
    Kise, K., Junker, M., Dengel, A., Matsumoto, K.: Experimental Evaluation of Passage-Based Document Retrieval, in Proc. of the 6th ICDAR, pp.592–596, 2001.Google Scholar
  10. 10.
    Kurohashi, S., Shiraki, N., Nagao, M.: A Method for Detecting Important Descriptions of a Word Based on Its Density Distribution in Text, Trans. Information Processing Society of Japan, Vol.38, No.4, pp.845–853, 1997 (In Japanese).Google Scholar
  11. 11.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison-Wesley Pub. Co., 1999.Google Scholar
  12. 12.
    Sakai, T., et al.: BMIR-J2: A Test Collection for Evaluation of Japanese Information Retrieval Systems, SIGIR Forum, Vol.33, No.1, pp.13–17, 1999.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Koichi Kise
    • 1
  • Masaaki Tsujino
    • 1
  • Keinosuke Matsumoto
    • 1
  1. 1.Department of Computer and Systems Sciences, Graduate School of EngineeringOsaka Prefecture UniversitySakaiJapan

Personalised recommendations