An Algorithm for Headline and Column Separation in Bangla Documents

  • Farjana Yeasmin Omee
  • Md. Shiam Shabbir Himel
  • Md. Abu Naser Bikas
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 182)

Abstract

With the progression of digitization it is very necessary to archive the Bangla newspaper as well as other Bangla documents. The first step of reading Bangla Newspaper is to detect headlines and column from multi column newspaper. But there is no such algorithm developed so far in Bangla OCR that can fully read Bangla Newspaper. In this paper we present an algorithmic approach for multi column & headline detection from Bangla newspaper as well as Bangla magazine. It can separate headlines from news and also can detect columns from multi column. This algorithm works based on empty space between headline- columns, column-column.

Keywords

Space Width Column Separation Text Block Page Layout Space Height 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Omee, F.Y., Himel, S.S., Bikas, A.N.: A Complete Workflow for Development of Bangla OCR. International Journal of Computer Applications (IJCA) 21(9), 1–6 (2011), doi:10.5120/2543-3483CrossRefGoogle Scholar
  2. 2.
    Ray Chaudhuri, A., Mandal, A.K., Chaudhuri, B.B.: Page Layout Analyzer for Multilingual Indian Documents. In: Proceedings of the Language Engineering Conference, LEC. IEEE (2002)Google Scholar
  3. 3.
    Khedekar, S., Ramanaprasad, V., Setlur, S.: Text-Image Separation in Devanagari Documents. In: 7th International Conference on Document Analysis and Recognition, ICDAR. IEEE (2003)Google Scholar
  4. 4.
    Hasnat, A., Murtoza Habib, S.M., Khan, M.: Segmentation free Bangla OCR using HMM: Training and Recognition. In: Proceeding of 1st DCCA, Irbid, Jordan (2007)Google Scholar
  5. 5.
  6. 6.
    Hasnat, A., Murtoza Habib, S.M., Khan, M.: A high performance domain specific OCR for Bangla script. In: International Joint Conference on Computer, Information, and Systems Sciences, and Engineering, CISSE (2007)Google Scholar
  7. 7.
    Smith, R.: An Overview of the Tesseract OCR Engine. In: Proceeding of ICDAR 2007, vol. 2, pp. 629–633 (2007)Google Scholar
  8. 8.
  9. 9.
    Breuel, T.M.: The OCRopus Open Source OCR System. In: Proceedings of the Document and Retrival XV, IS&T/SPIE 20th Annual Symposium, San Jose, CA, United States, vol. 6815. SPIE (2008)Google Scholar
  10. 10.
    Patnaik, T., Gupta, S., Arya, D.: Comparison of Binarization Algorithmin Indian Language OCR. In: Annual Seminar of CDAC-Noida Technologies, ASCNT (2010)Google Scholar
  11. 11.
    Gonzalez, Woods: Digital image processing, 2nd edn., ch. 4 sec. 4.3, 4.4; ch. 5 sec. 5.1–5.3, pp. 167–184, 220–243. Prentice Hall (2002)Google Scholar
  12. 12.
  13. 13.
    Murtoza Habib, S.M., Noor, N.A., Khan, M.: Skew Angle Detection of Bangla script using Radon Transform. In: Proceeding of 9th ICCIT (2006)Google Scholar
  14. 14.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Farjana Yeasmin Omee
    • 1
  • Md. Shiam Shabbir Himel
    • 1
  • Md. Abu Naser Bikas
    • 1
  1. 1.Department of Computer Science and EngineeringShahjalal University of Science and TechnologySylhetBangladesh

Personalised recommendations