Building a multi-modal Arabic corpus (MMAC)

AbdelRaouf, Ashraf; Higgins, Colin A.; Pridmore, Tony; Khalil, Mahmoud

doi:10.1007/s10032-010-0128-2

Ashraf AbdelRaouf^1,2,
Colin A. Higgins¹,
Tony Pridmore¹ &
…
Mahmoud Khalil³

321 Accesses
21 Citations
Explore all metrics

Abstract

Traditionally, a corpus is a large structured set of text, electronically stored and processed. Corpora have become very important in the study of languages. They have opened new areas of linguistic research, which were unknown until recently. Corpora are also key to the development of optical character recognition (OCR) applications. Access to a corpus of both language and images is essential during OCR development, particularly while training and testing a recognition application. Excellent corpora have been developed for Latin-based languages, but few relate to the Arabic language. This limits the penetration of both corpus linguistics and OCR in Arabic-speaking countries. This paper describes the construction and provides a comprehensive study and analysis of a multi-modal Arabic corpus (MMAC) that is suitable for use in both OCR development and linguistics. MMAC currently contains six million Arabic words and, unlike previous corpora, also includes connected segments or pieces of Arabic words (PAWs) as well as naked pieces of Arabic words (NPAWs) and naked words (NWords); PAWs and Words without diacritical marks. Multi-modal data is generated from both text, gathered from a wide variety of sources, and images of existing documents. Text-based data is complemented by a set of artificially generated images showing each of the Words, NWords, PAWs and NPAWs involved. Applications are provided to generate a natural-looking degradation to the generated images. A ground truth annotation is offered for each such image, while natural images showing small paragraphs and full pages are augmented with representations of the text they depict. A statistical analysis and verification of the dataset has been carried out and is presented. MMAC was also tested using commercial OCR software and is publicly and freely available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Kučera H., Francis W.N.: Computational analysis of present-day American English. Int. J. Am. Linguist. 35(1), 71–75 (1967)
Google Scholar
Davies, M.: (1990-present) The corpus of contemporary American english (COCA), 410+ Million words. http://www.americancorpus.org (2008)
The British National Corpus: Oxford University. http://www.natcorp.ox.ac.uk (2005)
Time: Time archive 1923 to present. http://www.time.com/time/archive/ (2008)
Knight, D., Bayoumi, S., Mills, S., Crabtree, A., Adolphs, S., Pridmore, T., Carter, R.: Beyond the text: building and analysing multi-modal corpora. In: 2nd International Conference on E-Social Science. Manchester, UK (2006)
Indexes : United nations documentation.: the Department of Public Information (DPI), Dag Hammarskjöld Library (DHL). http://www.un.org/Depts/dhl/resguide/itp.htm (2007)
Internet world users by language: Top ten languages used in the web. Internet World Stats, Usage and Population Statistics. http://www.internetworldstats.com/stats7.htm. Accessed 22-01-07 (2007)
UCLA TUoC, Los Angeles: Arabic. International Institute, Center for World Languages, Language Materials Project. http://www.lmp.ucla.edu/Profile.aspx?LangID=210&menu=004 (2006)
Hamada, S.: . In: The Seventh Conference on Language Engineering. Cairo, Egypt (2007)
Lorigo L.M., Govindaraju V.: Offline Arabic handwriting recognition: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 712–724 (2006)
Article Google Scholar
Amin, A.: Off line Arabic character recognition—a survey. In: The Fourth International Conference on Document Analysis and Recognition, pp. 596–599. Ulm, Germany (1997)
AbdelRaouf, A., Higgins, C., Khalil, M.: A database for Arabic printed character recognition. In: The International Conference on Image Analysis and Recognition-ICIAR 2008, Póvoa de Varzim, Portugal, pp. 567–578 (2008)
IRIS: Readiris pro 10 (2004)
Parker R., Graff D., Chen K., Kong J., Maeda K.: Arabic Gigaword. Linguistic Data Consortium, University of Pennsylvania, Philadelphia (2009)
Google Scholar
CLARA (Corpus Linguae Arabicae): Charles University, Prague (2001)
Al-Hayat newspaper, Al-Hayat Arabic data set, University of Essex, in collaboration with the Open University
An-Nahar newspaper: An-Nahar text corpus (2000)
Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: A new Arabic printed text image database and evaluation protocols. In: 10th International Conference on Document Analysis and Recognition, pp. 946–950. Barcelona, Spain (2009)
Abbes, R., Dichy, J., Hassoun, M.: The architecture of a standard Arabic lexical database. Some figures, ratios and categories from the DIINAR.1 source program. In: Workshop of Computational Approaches to Arabic Script-based Languages, pp. 15–22. Geneva, Switzerland (2004)
Beesley, K.R.: Arabic finite-state morphological analysis and generation. In: 16th International Conference on Computational Linguistics, pp. 89–94. Copenhagen (1996)
Alansary, S., Nagi, M., Adly, N.: Building an International Corpus of Arabic (ICA): progress of compilation stage. In: The Seventh Conference on Language Engineering. Cairo, Egypt (2007)
Wynne, M.: Corpus and text—basic principles. In: Developing Linguistic Corpora: A Guide to Good Practice. Oxbow Books, Oxford. Available online from http://ahds.ac.uk/linguistic-corpora/ (2005)
Dash, N.S., Chaudhuri, B.B.: Why do we need to develop corpora in Indian languages? In: the International Working Conference on Sharing Capability in Localisation and Human Language Technologies SCALLA-2001. Bangalore (2001)
Al-Shalabi, R., Evens, M.: A computational morphology system for Arabic. In: Workshop on Computational Approaches to Semitic Languages COLING-ACL98, pp. 66–72. Montreal (1998)
The Unicode consortium: Arabic, range: 0600-06ff. The Unicode Standard, Version 5 (2007)
The Unicode consortium: The Unicode standard, version 4.1.0. In: pp. 195–206. Boston, MA, Addison-Wesley (2003)
The Unicode consortium: Arabic shaping The Unicode Standard, Version 5 (2006)
Khorsheed M.S.: Off-line Arabic character recognition—a review. Pattern Anal. Appl. 5(1), 31–45 (2002)
Article MathSciNet Google Scholar
Fahmy M.M.M., Ali S.A.: Automatic recognition of handwritten Arabic characters using their geometrical features. J. Stud. Inform. Control Emphasis Useful Appl. Adv. Technol. 10(2), 81–98 (2001)
Google Scholar
Buckwalter, T.: Issues in Arabic orthography and morphology analysis. In: The 20th International Conference on Computational Linguistics, COLING 2004, pp. 31–34. Geneva, Switzerland (2004)
Harty R., Ghaddar C.: Arabic text recognition. Int. Arab. J. Inf. Technol. 1(2), 156–163 (2004)
Google Scholar
Contributors, W.: Code page. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/w/index.php?title=Code_page&oldid=87192444. Accessed 22 /01/07 (2006)
Beebe N.H.F.: Character set encoding. TUGboat 11(2), 171–175 (1990)
Google Scholar
arabo.com: Arabo Arab search engine and dictionary. http://www.arabo.com/. Accessed 12-01-07 (2005)
Ltd SP: Webzip 7.0. 7.0 edn (2006)
Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: 25th International Conference on Research and Development in Information Retrieval (SIGIR), pp. 269–274 (2002)
Kanungo, T., Resnik, P.: The bible, truth, and multilingual OCR evaluation. In: the SPIE Conference on Document Recognition and Retrieval VI, pp. 86–96. San Jose, CA (1999)
Chang Y., Chen D., Zhang Y., Yang J.: An image-based automatic Arabic translation system. Pattern Recognit. 42(9), 2127–2134 (2009)
Article MATH Google Scholar
Kanoun, S., Alimi, A.M., Lecourtier, Y.: Affixal approach for Arabic decomposable vocabulary recognition: A validation on printed word in only one font. In: The Eight International Conference on Document Analysis and Recognition (ICDAR’05), pp. 1025–1029. Seoul, Korea (2005)
Box G.E.P., Muller M.E.: A note on the generation of random normal deviates. Ann. Math. Stat. 29(2), 610–611 (1958)
Article MATH Google Scholar
Sonka, M., Hlavac, V., Boyle, R.: Image Processing: Analysis and Machine Vision, 2nd edition edn. Thomson Learning Vocational (1998)
Hartley, R.T., Crumpton, K.: Quality of OCR for degraded text images. In: The Fourth ACM Conference on Digital Libraries, pp. 228–229 Berkeley, California, United States (1999)
Leea C.H., Kanungob T.: The architecture of trueViz: a grounDTRUth=metadata editing and vIsualiZing toolKit. Pattern Recognit. 36(3), 811–825 (2003)
Article Google Scholar
Mehran, R., Pirsiavash, H., Razzazi, F.: A front-end OCR for omni-font Persian/Arabic cursive printed documents. In: Digital Image Computing: Techniques and Applications (DICTA’05), pp 56–64. Cairns, Australia (2005)
Najoua, B.A., Noureddine, E.: A robust approach for Arabic printed character segmentation. In: Third International Conference on Document Analysis and Recognition (ICDAR’95), pp. 865–868. Montreal, Canada, (1995)
Bushofa B.M.F., Spann M.: Segmentation and recognition of Arabic characters by structural classification. Image Vis Comput. 15(3), 167–179 (1997)
Article Google Scholar
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Article Google Scholar
Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. D-Lib Mag. 15(7/8), (2009)
Buckwalter, T.: Arabic word frequency counts. http://www.qamus.org/wordlist.htm. Accessed 28/01/07 (2002)
Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora. In: The 5th ACL Workshop on Very Large Corpora, pp. 231–245. Beijing and Hong Kong (1997)
AL-Ma’adeed, S., Elliman, D., Higgins, C.A.: A data base for Arabic handwritten text recognition research. In: Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 485–489 Ontario, Canada (2002)
Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H.: IFN/ENIT—Database of handwritten Arabic words. In: The 7th Colloque International Francophone sur l’Ecrit et le Document, CIFED 2002, pp. 129–136. Hammamet, Tunisia (2002)
Mashali, S., Mahmoud, A., Elnemr, H., Ahmed, G., Osama, S.: Arabic OCR database development. In: The Fifth Conference on Language Engineering, pp. 250–283. Cairo, Egypt (2005)
Hyams, D.G.: CurveExpert 1.3, a comprehensive curve fitting system for windows (2005)
Gu, B., Hu, F., Liu, H.: Modelling classification performance for large data sets, an empirical study. In: Advances in web-age information management: second international conference, waim 2001, pp. 317–328. xi’an, china (2001)
contributors, W.: Romanization of Arabic. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/wiki/Arabic_transliteration. Accessed 27/01/07 (2006)
contributors, W.: Arabic chat alphabet. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/wiki/Arabic_Chat_Alphabet. Accessed 27/01/07 (2006)
Palfreyman, D., Khalil, M.a.: A funky language for teenzz to use: representing gulf Arabic in instant messaging. J. Comput. Mediat. Commun. 9(1) (2003)
Buckwalter, T.: Buckwalter Arabic transliteration. http://www.qamus.org/transliteration.htm. Accessed 28/01/07 (2002)
Ananthakrishnan, S., Bangalore, S., Narayanan, S.: Automatic diacritization of Arabic transcripts for automatic speech recognition. In: International Conference on Natural Language Processing. Kanpur, India (2005)

Download references

Author information

Authors and Affiliations

School of Computer Science, The University of Nottingham, Nottingham, UK
Ashraf AbdelRaouf, Colin A. Higgins & Tony Pridmore
Faculty of Computer Science, Misr International University, Cairo, Egypt
Ashraf AbdelRaouf
Computer and Systems Engineering Department, Faculty of Engineering, Ain Shams University, Cairo, Egypt
Mahmoud Khalil

Authors

Ashraf AbdelRaouf
View author publications
You can also search for this author in PubMed Google Scholar
Colin A. Higgins
View author publications
You can also search for this author in PubMed Google Scholar
Tony Pridmore
View author publications
You can also search for this author in PubMed Google Scholar
Mahmoud Khalil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashraf AbdelRaouf.

Rights and permissions

Reprints and permissions

About this article

Cite this article

AbdelRaouf, A., Higgins, C.A., Pridmore, T. et al. Building a multi-modal Arabic corpus (MMAC). IJDAR 13, 285–302 (2010). https://doi.org/10.1007/s10032-010-0128-2

Download citation

Received: 03 June 2009
Revised: 03 September 2010
Accepted: 09 September 2010
Published: 29 September 2010
Issue Date: December 2010
DOI: https://doi.org/10.1007/s10032-010-0128-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building a multi-modal Arabic corpus (MMAC)

Abstract

Access this article

Similar content being viewed by others

Arabic optical character recognition software: A review

Arabic Character Recognition

ASAR 2021 Competition on Online Arabic Word Recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Building a multi-modal Arabic corpus (MMAC)

Abstract

Access this article

Similar content being viewed by others

Arabic optical character recognition software: A review

Arabic Character Recognition

ASAR 2021 Competition on Online Arabic Word Recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation