Building a multi-modal Arabic corpus (MMAC)

  • Ashraf AbdelRaoufEmail author
  • Colin A. Higgins
  • Tony Pridmore
  • Mahmoud Khalil
Original Paper


Traditionally, a corpus is a large structured set of text, electronically stored and processed. Corpora have become very important in the study of languages. They have opened new areas of linguistic research, which were unknown until recently. Corpora are also key to the development of optical character recognition (OCR) applications. Access to a corpus of both language and images is essential during OCR development, particularly while training and testing a recognition application. Excellent corpora have been developed for Latin-based languages, but few relate to the Arabic language. This limits the penetration of both corpus linguistics and OCR in Arabic-speaking countries. This paper describes the construction and provides a comprehensive study and analysis of a multi-modal Arabic corpus (MMAC) that is suitable for use in both OCR development and linguistics. MMAC currently contains six million Arabic words and, unlike previous corpora, also includes connected segments or pieces of Arabic words (PAWs) as well as naked pieces of Arabic words (NPAWs) and naked words (NWords); PAWs and Words without diacritical marks. Multi-modal data is generated from both text, gathered from a wide variety of sources, and images of existing documents. Text-based data is complemented by a set of artificially generated images showing each of the Words, NWords, PAWs and NPAWs involved. Applications are provided to generate a natural-looking degradation to the generated images. A ground truth annotation is offered for each such image, while natural images showing small paragraphs and full pages are augmented with representations of the text they depict. A statistical analysis and verification of the dataset has been carried out and is presented. MMAC was also tested using commercial OCR software and is publicly and freely available.


Corpora Arabic Linguistics Pattern recognition OCR 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kučera H., Francis W.N.: Computational analysis of present-day American English. Int. J. Am. Linguist. 35(1), 71–75 (1967)Google Scholar
  2. 2.
    Davies, M.: (1990-present) The corpus of contemporary American english (COCA), 410+ Million words. (2008)
  3. 3.
    The British National Corpus: Oxford University. (2005)
  4. 4.
    Time: Time archive 1923 to present. (2008)
  5. 5.
    Knight, D., Bayoumi, S., Mills, S., Crabtree, A., Adolphs, S., Pridmore, T., Carter, R.: Beyond the text: building and analysing multi-modal corpora. In: 2nd International Conference on E-Social Science. Manchester, UK (2006)Google Scholar
  6. 6.
    Indexes : United nations documentation.: the Department of Public Information (DPI), Dag Hammarskjöld Library (DHL). (2007)
  7. 7.
    Internet world users by language: Top ten languages used in the web. Internet World Stats, Usage and Population Statistics. Accessed 22-01-07 (2007)
  8. 8.
    UCLA TUoC, Los Angeles: Arabic. International Institute, Center for World Languages, Language Materials Project. (2006)
  9. 9.
    Hamada, S.: Open image in new window . In: The Seventh Conference on Language Engineering. Cairo, Egypt (2007)Google Scholar
  10. 10.
    Lorigo L.M., Govindaraju V.: Offline Arabic handwriting recognition: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 712–724 (2006)CrossRefGoogle Scholar
  11. 11.
    Amin, A.: Off line Arabic character recognition—a survey. In: The Fourth International Conference on Document Analysis and Recognition, pp. 596–599. Ulm, Germany (1997)Google Scholar
  12. 12.
    AbdelRaouf, A., Higgins, C., Khalil, M.: A database for Arabic printed character recognition. In: The International Conference on Image Analysis and Recognition-ICIAR 2008, Póvoa de Varzim, Portugal, pp. 567–578 (2008)Google Scholar
  13. 13.
    IRIS: Readiris pro 10 (2004)Google Scholar
  14. 14.
    Parker R., Graff D., Chen K., Kong J., Maeda K.: Arabic Gigaword. Linguistic Data Consortium, University of Pennsylvania, Philadelphia (2009)Google Scholar
  15. 15.
    CLARA (Corpus Linguae Arabicae): Charles University, Prague (2001)Google Scholar
  16. 16.
    Al-Hayat newspaper, Al-Hayat Arabic data set, University of Essex, in collaboration with the Open UniversityGoogle Scholar
  17. 17.
    An-Nahar newspaper: An-Nahar text corpus (2000)Google Scholar
  18. 18.
    Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: A new Arabic printed text image database and evaluation protocols. In: 10th International Conference on Document Analysis and Recognition, pp. 946–950. Barcelona, Spain (2009)Google Scholar
  19. 19.
    Abbes, R., Dichy, J., Hassoun, M.: The architecture of a standard Arabic lexical database. Some figures, ratios and categories from the DIINAR.1 source program. In: Workshop of Computational Approaches to Arabic Script-based Languages, pp. 15–22. Geneva, Switzerland (2004)Google Scholar
  20. 20.
    Beesley, K.R.: Arabic finite-state morphological analysis and generation. In: 16th International Conference on Computational Linguistics, pp. 89–94. Copenhagen (1996)Google Scholar
  21. 21.
    Alansary, S., Nagi, M., Adly, N.: Building an International Corpus of Arabic (ICA): progress of compilation stage. In: The Seventh Conference on Language Engineering. Cairo, Egypt (2007)Google Scholar
  22. 22.
    Wynne, M.: Corpus and text—basic principles. In: Developing Linguistic Corpora: A Guide to Good Practice. Oxbow Books, Oxford. Available online from (2005)
  23. 23.
    Dash, N.S., Chaudhuri, B.B.: Why do we need to develop corpora in Indian languages? In: the International Working Conference on Sharing Capability in Localisation and Human Language Technologies SCALLA-2001. Bangalore (2001)Google Scholar
  24. 24.
    Al-Shalabi, R., Evens, M.: A computational morphology system for Arabic. In: Workshop on Computational Approaches to Semitic Languages COLING-ACL98, pp. 66–72. Montreal (1998)Google Scholar
  25. 25.
    The Unicode consortium: Arabic, range: 0600-06ff. The Unicode Standard, Version 5 (2007)Google Scholar
  26. 26.
    The Unicode consortium: The Unicode standard, version 4.1.0. In: pp. 195–206. Boston, MA, Addison-Wesley (2003)Google Scholar
  27. 27.
    The Unicode consortium: Arabic shaping The Unicode Standard, Version 5 (2006)Google Scholar
  28. 28.
    Khorsheed M.S.: Off-line Arabic character recognition—a review. Pattern Anal. Appl. 5(1), 31–45 (2002)CrossRefMathSciNetGoogle Scholar
  29. 29.
    Fahmy M.M.M., Ali S.A.: Automatic recognition of handwritten Arabic characters using their geometrical features. J. Stud. Inform. Control Emphasis Useful Appl. Adv. Technol. 10(2), 81–98 (2001)Google Scholar
  30. 30.
    Buckwalter, T.: Issues in Arabic orthography and morphology analysis. In: The 20th International Conference on Computational Linguistics, COLING 2004, pp. 31–34. Geneva, Switzerland (2004)Google Scholar
  31. 31.
    Harty R., Ghaddar C.: Arabic text recognition. Int. Arab. J. Inf. Technol. 1(2), 156–163 (2004)Google Scholar
  32. 32.
    Contributors, W.: Code page. From wikipedia, the free encyclopaedia. Accessed 22 /01/07 (2006)
  33. 33.
    Beebe N.H.F.: Character set encoding. TUGboat 11(2), 171–175 (1990)Google Scholar
  34. 34. Arabo Arab search engine and dictionary. Accessed 12-01-07 (2005)
  35. 35.
    Ltd SP: Webzip 7.0. 7.0 edn (2006)Google Scholar
  36. 36.
    Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: 25th International Conference on Research and Development in Information Retrieval (SIGIR), pp. 269–274 (2002)Google Scholar
  37. 37.
    Kanungo, T., Resnik, P.: The bible, truth, and multilingual OCR evaluation. In: the SPIE Conference on Document Recognition and Retrieval VI, pp. 86–96. San Jose, CA (1999)Google Scholar
  38. 38.
    Chang Y., Chen D., Zhang Y., Yang J.: An image-based automatic Arabic translation system. Pattern Recognit. 42(9), 2127–2134 (2009)zbMATHCrossRefGoogle Scholar
  39. 39.
    Kanoun, S., Alimi, A.M., Lecourtier, Y.: Affixal approach for Arabic decomposable vocabulary recognition: A validation on printed word in only one font. In: The Eight International Conference on Document Analysis and Recognition (ICDAR’05), pp. 1025–1029. Seoul, Korea (2005)Google Scholar
  40. 40.
    Box G.E.P., Muller M.E.: A note on the generation of random normal deviates. Ann. Math. Stat. 29(2), 610–611 (1958)zbMATHCrossRefGoogle Scholar
  41. 41.
    Sonka, M., Hlavac, V., Boyle, R.: Image Processing: Analysis and Machine Vision, 2nd edition edn. Thomson Learning Vocational (1998)Google Scholar
  42. 42.
    Hartley, R.T., Crumpton, K.: Quality of OCR for degraded text images. In: The Fourth ACM Conference on Digital Libraries, pp. 228–229 Berkeley, California, United States (1999)Google Scholar
  43. 43.
    Leea C.H., Kanungob T.: The architecture of trueViz: a grounDTRUth=metadata editing and vIsualiZing toolKit. Pattern Recognit. 36(3), 811–825 (2003)CrossRefGoogle Scholar
  44. 44.
    Mehran, R., Pirsiavash, H., Razzazi, F.: A front-end OCR for omni-font Persian/Arabic cursive printed documents. In: Digital Image Computing: Techniques and Applications (DICTA’05), pp 56–64. Cairns, Australia (2005)Google Scholar
  45. 45.
    Najoua, B.A., Noureddine, E.: A robust approach for Arabic printed character segmentation. In: Third International Conference on Document Analysis and Recognition (ICDAR’95), pp. 865–868. Montreal, Canada, (1995)Google Scholar
  46. 46.
    Bushofa B.M.F., Spann M.: Segmentation and recognition of Arabic characters by structural classification. Image Vis Comput. 15(3), 167–179 (1997)CrossRefGoogle Scholar
  47. 47.
    Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)CrossRefGoogle Scholar
  48. 48.
    Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. D-Lib Mag. 15(7/8), (2009)Google Scholar
  49. 49.
    Buckwalter, T.: Arabic word frequency counts. Accessed 28/01/07 (2002)
  50. 50.
    Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora. In: The 5th ACL Workshop on Very Large Corpora, pp. 231–245. Beijing and Hong Kong (1997)Google Scholar
  51. 51.
    AL-Ma’adeed, S., Elliman, D., Higgins, C.A.: A data base for Arabic handwritten text recognition research. In: Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 485–489 Ontario, Canada (2002)Google Scholar
  52. 52.
    Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H.: IFN/ENIT—Database of handwritten Arabic words. In: The 7th Colloque International Francophone sur l’Ecrit et le Document, CIFED 2002, pp. 129–136. Hammamet, Tunisia (2002)Google Scholar
  53. 53.
    Mashali, S., Mahmoud, A., Elnemr, H., Ahmed, G., Osama, S.: Arabic OCR database development. In: The Fifth Conference on Language Engineering, pp. 250–283. Cairo, Egypt (2005)Google Scholar
  54. 54.
    Hyams, D.G.: CurveExpert 1.3, a comprehensive curve fitting system for windows (2005)Google Scholar
  55. 55.
    Gu, B., Hu, F., Liu, H.: Modelling classification performance for large data sets, an empirical study. In: Advances in web-age information management: second international conference, waim 2001, pp. 317–328. xi’an, china (2001)Google Scholar
  56. 56.
    contributors, W.: Romanization of Arabic. From wikipedia, the free encyclopaedia. Accessed 27/01/07 (2006)
  57. 57.
    contributors, W.: Arabic chat alphabet. From wikipedia, the free encyclopaedia. Accessed 27/01/07 (2006)
  58. 58.
    Palfreyman, D., Khalil, M.a.: A funky language for teenzz to use: representing gulf Arabic in instant messaging. J. Comput. Mediat. Commun. 9(1) (2003)Google Scholar
  59. 59.
    Buckwalter, T.: Buckwalter Arabic transliteration. Accessed 28/01/07 (2002)
  60. 60.
    Ananthakrishnan, S., Bangalore, S., Narayanan, S.: Automatic diacritization of Arabic transcripts for automatic speech recognition. In: International Conference on Natural Language Processing. Kanpur, India (2005)Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Ashraf AbdelRaouf
    • 1
    • 2
    Email author
  • Colin A. Higgins
    • 1
  • Tony Pridmore
    • 1
  • Mahmoud Khalil
    • 3
  1. 1.School of Computer ScienceThe University of NottinghamNottinghamUK
  2. 2.Faculty of Computer ScienceMisr International UniversityCairoEgypt
  3. 3.Computer and Systems Engineering Department, Faculty of EngineeringAin Shams UniversityCairoEgypt

Personalised recommendations