Skip to main content
Log in

A Nom historical document recognition system for digital archiving

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

A Nom historical document recognition system is being developed for digital archiving that uses image binarization, character segmentation, and character recognition. It incorporates two versions of off-line character recognition: one for automatic recognition of scanned and segmented character patterns (7660 categories) and the other for user handwritten input (32,695 categories). This separation is used since including less frequently appearing categories in automatic recognition increases the misrecognition rate without reliable statistics on the Nom language. Moreover, a user must be able to check the results and identify the correct categories from an extended set of categories, and a user can input characters by hand. Both versions use the same recognition method, but they are trained using different sets of training patterns. Recursive XY cut and Voronoi diagrams are used for segmentation; kd tree and generalized learning vector quantization are used for coarse classification; and the modified quadratic discriminant function is used for fine classification. The system provides an interface through which a user can check the results, change binarization methods, rectify segmentation, and input correct character categories by hand. Evaluation done using a limited number of Nom historical documents after providing ground truths for them showed that the two stages of recognition along with user checking and correction improved the recognition results significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Kim, M.S., Jang, M.D., Choi, H.I., Rhee, T.H., Kim, J.H., Kwag, H.K.: Digitalizing scheme of handwritten Hanja historical documents. In: Proceedings of the 1st International Workshop on Document Image Analysis for Libraries, USA, pp. 321–327, Jan. 2004

  2. Shih, V.J., Chu, T.L.: The Han Nom Digital Library. In: The International Nom Conference, The National Library of Vietnam, Hanoi, pp. 12–14, Nov. 2004

  3. Phan, T.V., Zhu, B., Nakagawa, M.: Development of Nom character segmentation for collecting patterns from historical document pages. In: Proceedings of 1st International Workshop on Historical Document Imaging and Processing, China, pp. 133–139, Sep. 2011

  4. Phan, T.V., Zhu, B., Nakagawa, M.: Collecting handwritten Nom character patterns from historical document pages. In: Proceedings of 10th IAPR International Workshop on Document Analysis Systems, Australia, pp. 344–348, Mar. 2012

  5. Su, B., Lu, S., Tan, C.L.: Binarization of historical handwritten document images using local maximum and minimum filter. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, USA, pp. 159–165, Jun. 2010

  6. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

  7. Kittler, J., Illingworth, J.: Threshold selection based on a simple image statistics. Comput. Vis. Graphics Image Process. 30, 125–147 (1985)

    Article  Google Scholar 

  8. Schindelin, J., Arganda-Carreras, I., Frise, E., Kaynig, V., Longair, M., Pietzsch, T., Cardona, A.: Fiji: an open-source platform for biological-image analysis. Nat. Methods. 9(7), 676–682 (2012)

    Article  Google Scholar 

  9. Tsukumo, J., Tanaka, H.: Classification of handprinted Chinese characters using non-linear normalization and correlation methods. In: Proceedings of the 9th International Conference on Pattern Recognition, Italy, pp. 168–171 (1988)

  10. Liu, C.L.: Normalization-cooperated gradient feature extraction for handwritten character recognition. Pattern Anal. Mach. Intell. IEEE Trans. 29(8), 1465–1469 (2007)

    Article  Google Scholar 

  11. Kawamura, A., Yura, K., Hayama, T., Hidai, Y., Minamikawa, T., Tanaka, A., Masuda, S.: Online recognition of freely handwritten Japanese characters using directional feature densities. In: Proceedings of the 11th International Conference on Pattern Recognition, Netherlands, 2, pp. 183–186 (1992)

  12. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San Diego (1990)

    MATH  Google Scholar 

  13. Kimura, F., Takashina, K., Tsuruoka, S., Miyake, Y.: Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Trans. PAMI 9(1), pp. 149–153 (1987)

  14. Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., Torkkola, K.: LVQ PAK: The learning vector quantization program package. In: Technical Report, Laboratory of Computer and Information Science Rakentajanaukio 2(C), pp. 1991–1992 (1996)

  15. Sato, A., Yamada, K.: Generalized learning vector quantization. In: Proceedings of the 1995 Conference on Advances in Neural Information Processing Systems, vol 8, pp 423–429. MIT Press, Cambridge, USA (1996)

  16. Juang, B.-H., Katagiri, S.: Discriminative learning for minimum error classification. Signal Process. IEEE Trans. 40(12), 3043–3054 (1992)

    Article  MATH  Google Scholar 

  17. Liu, C.L., Nakagawa, M.: Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognit. 34(3), 601–615 (2001)

    Article  MATH  Google Scholar 

  18. Fukumoto, T., Wakabayashi, T., Kimura, F., Miyake, Y.: Accuracy improvement of handwritten character recognition by GLVQ. In: Proceedings of the 7th International Workshop on Frontiers in handwriting recognition, pp. 687–692. The Netherlands (2000)

  19. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  20. Phan, T.V., Nakagawa, M., Baba, H., Watanabe, A.: MokkAnnotator - A system for archiving Mokkan images. In: Proceedings of the 16th Biennial Conference of the International Graphonomics Society, Japan, pp. 54–57, Jun. 2013

  21. Nakagawa, M., Matsumoto, K.: Collection of on-line handwritten Japanese character pattern databases and their analysis. Doc. Anal. Recognit. 7(1), 69–81 (2004)

  22. Chen, B., Zhu, B., Nakagawa, M.: Effects of generating a large amount of artificial patterns for on-line handwritten Japanese character recognition. In: Proceedings of the 11th International Conference on Document Analysis and Recognition, China, pp. 663–667, Sep. 2011

  23. Leung, K.C., Leung, C.H.: Recognition of handwritten Chinese characters by combining regularization, Fisher’s discriminant and transformation sample generation. In: Proceedings of the 10th International Conference of Document Analysis and Recognition, Spain, pp. 1026–1030 (2009)

Download references

Acknowledgments

We thank the National Library of Vietnam and the Vietnamese Nom Preservation Foundation for providing the Nom historical document pages. This research is being supported by Grant-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (JSPS) (contract numbers (B) 24300095 and (S) 25220401).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masaki Nakagawa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Van Phan, T., Cong Nguyen, K. & Nakagawa, M. A Nom historical document recognition system for digital archiving. IJDAR 19, 49–64 (2016). https://doi.org/10.1007/s10032-015-0257-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-015-0257-8

Keywords

Navigation