Multimedia Tools and Applications

, Volume 76, Issue 3, pp 4123–4139 | Cite as

Multilingual corpus construction based on printed and handwritten character separation

  • Yuping Lin
  • Yonghong Song
  • Yingyu Li
  • Fang Wang
  • Kai He


This paper proposes an effective method to extract printed and handwritten characters from multilingual document images to build corpus. To extract the characters from the document images, a connected component analysis method is used to remove the graphics. After that, multiple types of features and AdaBoost algorithm are introduced to classify printed and handwritten characters in a more versatile and robust way. Firstly, the content of the image is divided into several text patches which are then used to distinguish different languages. Secondly, we use the multiple types of features and AdaBoost algorithm to train the classifiers based on the segmented patches. Finally, we can separate printed and handwritten parts of new image set by the trained classifiers. The proposed method improves the precision of the extraction of written materials in text images of different languages. Experimental results demonstrate that the proposed method is more accurate in terms of precision and recall rate compared with the state-of the-art methods.


Multilingual corpus Machine printed character Handwritten character Character extraction AdaBoost 


  1. 1.
    Agam, G., Argamon, S., Frieder, O., Grossman, D., Lewis, D.: The complex document image processing (CDIP) test collection. Illinois Inst Technol (2006)Google Scholar
  2. 2.
    Anthony L (2013) A critical look at software tools in corpus linguistics. Linguist Res 30(2):141–161CrossRefGoogle Scholar
  3. 3.
    Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intel 27(11):1720–1732CrossRefGoogle Scholar
  4. 4.
    Chellappa R, Chatterjee S (1985) Classification of textures using Gaussian Markov random fields. IEEE Trans Acoust, Speech Signal Process 33(4):959–963MathSciNetCrossRefGoogle Scholar
  5. 5.
    Drivas, D., Amin, A.: Page segmentation and classification utilising a bottom-up approach. In: Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 610–614. (1995)Google Scholar
  6. 6.
    Fan K, Wang L, Tu Y (1998) Classification of machine-printed and handwritten texts using character block layout variance. Pattern Recogn 31(9):1275–1284CrossRefGoogle Scholar
  7. 7.
    Franke, J., Oberlander, M.: Writing style detection by statistical combination of classifiers in form reader applications. In: Proceedings of the Second International Conference on Document Analysis and Recognition, pp. 581–584. (1993)Google Scholar
  8. 8.
    Gao Y, Wang M, Tao D, Ji R, Dai Q (2012) 3D object retrieval and recognition with hypergraph analysis. IEEE Trans Image Process 21(9):4290–4303MathSciNetCrossRefGoogle Scholar
  9. 9.
    Gao Y, Wang M, Zha Z, Shen J, Li X, Wu X (2013) Visual-textual joint relevance learning for tag-based social image search. IEEE Trans Image Process 22(1):363–376MathSciNetCrossRefGoogle Scholar
  10. 10.
    Gatos B, Stamatopoulos N, Louloudis G (2011) ICDAR2009 handwriting segmentation contest. IJDAR 14(1):25–33CrossRefGoogle Scholar
  11. 11.
    Guo, J.K., Ma, M.Y.: Separating handwritten material from machine printed text using hidden Markov models. In: Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 439–443. (2001)Google Scholar
  12. 12.
    Hochberg, J., Kerns, L., Kelly, P., Thomas, T.: Automatic script identification from images using cluster-based templates. In: Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 378–381. (1995)Google Scholar
  13. 13.
    Jain AK, Zhong Y (1996) Page segmentation using texture analysis. Pattern Recogn 29(5):743–770CrossRefGoogle Scholar
  14. 14.
    Johansson S (2002) Towards a multilingual corpus for contrastive analysis and translation studies. Lang Comput 43(1):47–59Google Scholar
  15. 15.
    Koyama, J., Kato, M., Hirose, A.: Handwritten character distinction method inspired by human vision mechanism. In: Proceedings of Neural Information Processing, pp. 1031–1040. (2008)Google Scholar
  16. 16.
    Kuhnke, K., Simoncini, L., Kovacs-V, Z.M.: A system for machine-written and hand-written character distinction. In: Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 811–814. (1995)Google Scholar
  17. 17.
    Kundu, A., He, Y., Bahl, P.: Recognition of handwritten word: first and second order hidden Markov model based approach. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 457–462. (1988)Google Scholar
  18. 18.
    Lewis D, Agam G, Argamon S, Frieder O, Grossman D, Heard J (2006) Building a test collection for complex document information processing. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 665–666Google Scholar
  19. 19.
    Liu Q, Zha Z, Yang Y (2014) Gradient-domain-based enhancement of multi-view depth video. Inf Sci 281:750–761MathSciNetCrossRefGoogle Scholar
  20. 20.
    Maguire P, Wisniewski EJ, Storms G (2010) A corpus study of semantic patterns in compounding. Corpus Linguist Linguist Theory 6:49–73Google Scholar
  21. 21.
    Pal U, Chaudhuri BB (2001) Machine-printed and hand-written text lines identification. Pattern Recogn Lett 22(3–4):431–441CrossRefMATHGoogle Scholar
  22. 22.
    Soffer, A.: Image categorization using texture features. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, pp. 233–237. (1997)Google Scholar
  23. 23.
    Srihari SN, Shin YC, Ramanaprasad V, Lee DS (1996) A system to read names and addresses on tax forms. Proceedings of the IEEE 84(7):1038–1049Google Scholar
  24. 24.
    Tan TN (1998) Rotation invariant texture features and their use in automatic script identification. IEEE Transact Pattern Anal Mach Intell 20(7):751–756CrossRefGoogle Scholar
  25. 25.
    Vyatkina N (2014) Review of multilingual corpora and multilingual corpus analysis. Lang Learn Technol 18(2):70–74Google Scholar
  26. 26.
    Zheng Y, Liu C, Ding X (2001) Single-character type identification. In: Proceedings of SPIE Conference Document Recognition and Retrieval, pp. 49–56Google Scholar
  27. 27.
    Zheng Y, Li H, Doermann D (2002) The segmentation and identification of handwriting in noisy document images. In: Proceedings of the 5th International Workshop on Document Analysis Systems, pp. 95–105Google Scholar
  28. 28.
    Zheng Y, Li H, Doermann D (2004) Machine printed text and handwriting identification in noisy document images. IEEE Trans Pattern Anal Mach Intell 26(3):337–353CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Yuping Lin
    • 1
  • Yonghong Song
    • 2
  • Yingyu Li
    • 1
  • Fang Wang
    • 1
  • Kai He
    • 2
  1. 1.School of Foreign StudiesXi’an Jiaotong UniversityXi’anChina
  2. 2.School of Electronic and Information EngineeringXi’an Jiaotong UniversityXi’anChina

Personalised recommendations