Abstract
Arabic word spotting is a key step for Arabic NLP and the text recognition task. Many recent studies have addressed segmentation problems in the Arabic language. However, many issues still have to be overcome. In this paper, we propose a new approach for segmenting an image Arabic text into its constituent words. Our approach consists of two main steps. In the first step, a set of features is extracted from connected components using the Run-length smoothing algorithm (RLSA). In the second step, spatially close connected components that are likely to belong to the same word component are grouped together. This is done via a learning technique called the self-organizing feature map (Kohonen map). We evaluated our approach on 300 images with different sizes and fonts for handwritten text using AHDB. Our results suggest that our approach can efficiently segments lines. Moreover, as our approach is based on a straightforward machine learning model, it should be possible to adapt it to other languages as well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text line and word segmentation of handwritten documents. Pattern Recogn. 42, 3169–3183 (2009)
Aouadi, N., Kacem-Echi, A.: Word extraction and recognition in Arabic handwritten text. Int. J. Comput. Inf. Sci. 12, 17–23 (2016)
Elzobi, M., Al-Hamadi, A., Aghbari, Z.A.: Off-line handwritten Arabic words segmentation based on structural features and connected components analysis. In: WSCG 2011: Communication Papers Proceedings: The 19th International Conference in Central Europe on Computer Graphics, Visualization, and Computer Vision, pp. 135–142 (2011)
Mahadevan, U., Nagabushnam, R.C.: Gap metrics for word separation in handwritten lines. In: Third International Conference on Document Analysis and Recognition, Montreal, Canada, pp. 124–127 (1995)
Marti, U.V., Bunke, H.: Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In: Sixth International Conference on Document Analysis and Recognition, Seattle, WA, USA, pp. 159–163 (2001)
Seni, G., Cohen, E.: External word segmentation of offline handwritten text lines. Pattern Recogn. 27(1), 41–52 (1994)
Belabiod, A., Belaïd, A.: Line and word segmentation of Arabic handwritten documents using neural networks, LORIA - University of Lorraine (2018)
Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, pp. 958–962 (2003)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine learning, pp. 369–376 (2006)
Kohonen, T.: The self-organizing map. Proc. IEEE 78(9), 1464–1480 (1990)
O’Gorman, L., Kasturi, R.: Executive Briefing: Document Image Analysis. IEEE Computer Society Press, Los Alamitos (1997)
Al-Dmour, A., Zitar, R.A.: Word extraction from Arabic handwritten documents based on statistical measures. Int. Rev. Comput. Soft. (IRECOS) 11, 436–444 (2016)
Al-Ma’adeed, S., Elliman, D., Higgins, C.A.: A database for Arabic handwritten text recognition research. In: Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 485–489 (2002)
AlKhateeb, J.H., Jiang, J., Ren, J., Ipson, S.: Interactive knowledge discovery for baseline estimation and word segmentation in handwritten Arabic text. In: Strangio, M.A. (ed.) Recent Advances in Technologies. IntechOpen, London (2009)
Zeki, A.M., Zakaria, M.S., Liong, C.-Y.: Segmentation of Arabic characters: a comprehensive survey. Int. J. Technol. Diffus. 2(4), 48–82 (2011)
Wang, J.-H., Lin, L.-D.: Improved median filter using min-max algorithm for image processing. Electron. Lett. 33(16), 1362–1363 (1997)
Bouressace, H., Csirik, J.: Recognition of the logical structure of Arabic newspaper pages. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 251–258. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_27
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bouressace, H., Csirik, J. (2019). A Self-organizing Feature Map for Arabic Word Extraction. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)