Advertisement

A Fast Japanese Word Extraction with Classification to Similarly-Shaped Character Categories and Morphological Analysis

  • Masaharu Ozaki
  • Katsuhiko Itonori
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1655)

Abstract

A fast word extraction technique from Japanese document images is described. It classifies each character image not into characters but into categories consisting of similarly shaped characters. Morphological analysis is performed on the sequence of the categories to obtain word candidates. Detailed classification is performed on character images that cannot be identified as single characters. Multi-template methodology and hierarchical classification is combined to make the classifier accurate and fast with low dimensional vectors. As a result of the experiments for the learning samples, the accuracy of classification was 99.3% and the speed was eight times faster than traditional Japanese OCRs. As experimental results for the test samples made from forty newspaper articles, the classification speed is still eight times faster. The morphological analysis greatly decreased character candidates with the fact that 85% of characters were identified as single characters on the newspaper article images.

Keywords

Single Character Learning Sample Character Candidate Shaped Character Initial Cluster Center 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Reference

  1. 1.
    Chen, F., et al.: Detecting and Locating Partially Specified Keywords in Scanned Images using Hidden Markov Models, in Proc. of ICDAR93, Tsukuba, Japan, 1993Google Scholar
  2. 2.
    Reiner, J. C., et al.: Document Reconstruction: A Thousand Words from One Picture, in Proc. of SDAIR'95, 1995Google Scholar
  3. 3.
    Kise, K., et al.: A Method of Post-Processing for Character Recognition Based on Syntactic and Semantic Analysis of Sentences, in Trans, of IEICE, Vol. J77-D-II No. 11, pp.2199–2209, 1994( in Japanese)Google Scholar
  4. 4.
    Umeda, M.: Character Determination Abilities in a Character Recognition System Using Word Dictionary, in Trans. of IEICE, Vol. J72-D-II No. 1, pp. 22–31, 1989 (in Japanese)Google Scholar
  5. 5.
    Kigo, K.: Improving Speed of Japanese OCR through Linguistic Preprocessing, in Proc. of ICDAR93, Tsukuba, Japan, 1993Google Scholar
  6. 6.
    Duda, R. O., and Hart, P. E.: Pattern Classification and Scene Analysis, Wiley-Interscience, 1973Google Scholar
  7. 7.
    Umeda et al.: Classification of Printed Text in Multiple Fonts with Peripheral Feature, in Technical Report of IEICE, PRL78-4, 1978 (in Japanese)Google Scholar
  8. 8.
    Omachi, S., et al.: An Algorithm for Construction of Multi-Template Dictionary for Character Recognition Considering Between-Class Variation, in Trans. of IEICE, Vol.J79-D-II, No. 9, pp. 1525–1533, 1996 (in Japanese)Google Scholar
  9. 9.
    Ito, A., et al.: A Method for Composing the Extended Dictionary in which the Same Character is Involved in the Different Clusters for a Hierarchical Chinese Characters Recognition System, in Trans. of IEICE, Vol. J78-D-II, No. 6, pp. 896–905, 1995 (in Japanese)Google Scholar
  10. 10.
    JUMAN (a User-Extensible Morphological Analyzer for Japanese), http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/juman.html

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Masaharu Ozaki
    • 1
  • Katsuhiko Itonori
    • 2
  1. 1.Development Center for IT BusinessJapan
  2. 2.Office Document Products GroupFuji Xerox Co., LtdAshigara-kami-gun, KanagawaJapan

Personalised recommendations