Skip to main content

A Self-organizing Feature Map for Arabic Word Extraction

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

Abstract

Arabic word spotting is a key step for Arabic NLP and the text recognition task. Many recent studies have addressed segmentation problems in the Arabic language. However, many issues still have to be overcome. In this paper, we propose a new approach for segmenting an image Arabic text into its constituent words. Our approach consists of two main steps. In the first step, a set of features is extracted from connected components using the Run-length smoothing algorithm (RLSA). In the second step, spatially close connected components that are likely to belong to the same word component are grouped together. This is done via a learning technique called the self-organizing feature map (Kohonen map). We evaluated our approach on 300 images with different sizes and fonts for handwritten text using AHDB. Our results suggest that our approach can efficiently segments lines. Moreover, as our approach is based on a straightforward machine learning model, it should be possible to adapt it to other languages as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text line and word segmentation of handwritten documents. Pattern Recogn. 42, 3169–3183 (2009)

    Article  Google Scholar 

  2. Aouadi, N., Kacem-Echi, A.: Word extraction and recognition in Arabic handwritten text. Int. J. Comput. Inf. Sci. 12, 17–23 (2016)

    Google Scholar 

  3. Elzobi, M., Al-Hamadi, A., Aghbari, Z.A.: Off-line handwritten Arabic words segmentation based on structural features and connected components analysis. In: WSCG 2011: Communication Papers Proceedings: The 19th International Conference in Central Europe on Computer Graphics, Visualization, and Computer Vision, pp. 135–142 (2011)

    Google Scholar 

  4. Mahadevan, U., Nagabushnam, R.C.: Gap metrics for word separation in handwritten lines. In: Third International Conference on Document Analysis and Recognition, Montreal, Canada, pp. 124–127 (1995)

    Google Scholar 

  5. Marti, U.V., Bunke, H.: Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In: Sixth International Conference on Document Analysis and Recognition, Seattle, WA, USA, pp. 159–163 (2001)

    Google Scholar 

  6. Seni, G., Cohen, E.: External word segmentation of offline handwritten text lines. Pattern Recogn. 27(1), 41–52 (1994)

    Article  Google Scholar 

  7. Belabiod, A., Belaïd, A.: Line and word segmentation of Arabic handwritten documents using neural networks, LORIA - University of Lorraine (2018)

    Google Scholar 

  8. Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, pp. 958–962 (2003)

    Google Scholar 

  9. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine learning, pp. 369–376 (2006)

    Google Scholar 

  10. Kohonen, T.: The self-organizing map. Proc. IEEE 78(9), 1464–1480 (1990)

    Article  Google Scholar 

  11. O’Gorman, L., Kasturi, R.: Executive Briefing: Document Image Analysis. IEEE Computer Society Press, Los Alamitos (1997)

    Google Scholar 

  12. Al-Dmour, A., Zitar, R.A.: Word extraction from Arabic handwritten documents based on statistical measures. Int. Rev. Comput. Soft. (IRECOS) 11, 436–444 (2016)

    Article  Google Scholar 

  13. Al-Ma’adeed, S., Elliman, D., Higgins, C.A.: A database for Arabic handwritten text recognition research. In: Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 485–489 (2002)

    Google Scholar 

  14. AlKhateeb, J.H., Jiang, J., Ren, J., Ipson, S.: Interactive knowledge discovery for baseline estimation and word segmentation in handwritten Arabic text. In: Strangio, M.A. (ed.) Recent Advances in Technologies. IntechOpen, London (2009)

    Google Scholar 

  15. Zeki, A.M., Zakaria, M.S., Liong, C.-Y.: Segmentation of Arabic characters: a comprehensive survey. Int. J. Technol. Diffus. 2(4), 48–82 (2011)

    Article  Google Scholar 

  16. Wang, J.-H., Lin, L.-D.: Improved median filter using min-max algorithm for image processing. Electron. Lett. 33(16), 1362–1363 (1997)

    Article  Google Scholar 

  17. Bouressace, H., Csirik, J.: Recognition of the logical structure of Arabic newspaper pages. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 251–258. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_27

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hassina Bouressace or János Csirik .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bouressace, H., Csirik, J. (2019). A Self-organizing Feature Map for Arabic Word Extraction. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27947-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27946-2

  • Online ISBN: 978-3-030-27947-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics