Skip to main content
Log in

Building a multi-modal Arabic corpus (MMAC)

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Traditionally, a corpus is a large structured set of text, electronically stored and processed. Corpora have become very important in the study of languages. They have opened new areas of linguistic research, which were unknown until recently. Corpora are also key to the development of optical character recognition (OCR) applications. Access to a corpus of both language and images is essential during OCR development, particularly while training and testing a recognition application. Excellent corpora have been developed for Latin-based languages, but few relate to the Arabic language. This limits the penetration of both corpus linguistics and OCR in Arabic-speaking countries. This paper describes the construction and provides a comprehensive study and analysis of a multi-modal Arabic corpus (MMAC) that is suitable for use in both OCR development and linguistics. MMAC currently contains six million Arabic words and, unlike previous corpora, also includes connected segments or pieces of Arabic words (PAWs) as well as naked pieces of Arabic words (NPAWs) and naked words (NWords); PAWs and Words without diacritical marks. Multi-modal data is generated from both text, gathered from a wide variety of sources, and images of existing documents. Text-based data is complemented by a set of artificially generated images showing each of the Words, NWords, PAWs and NPAWs involved. Applications are provided to generate a natural-looking degradation to the generated images. A ground truth annotation is offered for each such image, while natural images showing small paragraphs and full pages are augmented with representations of the text they depict. A statistical analysis and verification of the dataset has been carried out and is presented. MMAC was also tested using commercial OCR software and is publicly and freely available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Kučera H., Francis W.N.: Computational analysis of present-day American English. Int. J. Am. Linguist. 35(1), 71–75 (1967)

    Google Scholar 

  2. Davies, M.: (1990-present) The corpus of contemporary American english (COCA), 410+ Million words. http://www.americancorpus.org (2008)

  3. The British National Corpus: Oxford University. http://www.natcorp.ox.ac.uk (2005)

  4. Time: Time archive 1923 to present. http://www.time.com/time/archive/ (2008)

  5. Knight, D., Bayoumi, S., Mills, S., Crabtree, A., Adolphs, S., Pridmore, T., Carter, R.: Beyond the text: building and analysing multi-modal corpora. In: 2nd International Conference on E-Social Science. Manchester, UK (2006)

  6. Indexes : United nations documentation.: the Department of Public Information (DPI), Dag Hammarskjöld Library (DHL). http://www.un.org/Depts/dhl/resguide/itp.htm (2007)

  7. Internet world users by language: Top ten languages used in the web. Internet World Stats, Usage and Population Statistics. http://www.internetworldstats.com/stats7.htm. Accessed 22-01-07 (2007)

  8. UCLA TUoC, Los Angeles: Arabic. International Institute, Center for World Languages, Language Materials Project. http://www.lmp.ucla.edu/Profile.aspx?LangID=210&menu=004 (2006)

  9. Hamada, S.: . In: The Seventh Conference on Language Engineering. Cairo, Egypt (2007)

  10. Lorigo L.M., Govindaraju V.: Offline Arabic handwriting recognition: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 712–724 (2006)

    Article  Google Scholar 

  11. Amin, A.: Off line Arabic character recognition—a survey. In: The Fourth International Conference on Document Analysis and Recognition, pp. 596–599. Ulm, Germany (1997)

  12. AbdelRaouf, A., Higgins, C., Khalil, M.: A database for Arabic printed character recognition. In: The International Conference on Image Analysis and Recognition-ICIAR 2008, Póvoa de Varzim, Portugal, pp. 567–578 (2008)

  13. IRIS: Readiris pro 10 (2004)

  14. Parker R., Graff D., Chen K., Kong J., Maeda K.: Arabic Gigaword. Linguistic Data Consortium, University of Pennsylvania, Philadelphia (2009)

    Google Scholar 

  15. CLARA (Corpus Linguae Arabicae): Charles University, Prague (2001)

  16. Al-Hayat newspaper, Al-Hayat Arabic data set, University of Essex, in collaboration with the Open University

  17. An-Nahar newspaper: An-Nahar text corpus (2000)

  18. Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: A new Arabic printed text image database and evaluation protocols. In: 10th International Conference on Document Analysis and Recognition, pp. 946–950. Barcelona, Spain (2009)

  19. Abbes, R., Dichy, J., Hassoun, M.: The architecture of a standard Arabic lexical database. Some figures, ratios and categories from the DIINAR.1 source program. In: Workshop of Computational Approaches to Arabic Script-based Languages, pp. 15–22. Geneva, Switzerland (2004)

  20. Beesley, K.R.: Arabic finite-state morphological analysis and generation. In: 16th International Conference on Computational Linguistics, pp. 89–94. Copenhagen (1996)

  21. Alansary, S., Nagi, M., Adly, N.: Building an International Corpus of Arabic (ICA): progress of compilation stage. In: The Seventh Conference on Language Engineering. Cairo, Egypt (2007)

  22. Wynne, M.: Corpus and text—basic principles. In: Developing Linguistic Corpora: A Guide to Good Practice. Oxbow Books, Oxford. Available online from http://ahds.ac.uk/linguistic-corpora/ (2005)

  23. Dash, N.S., Chaudhuri, B.B.: Why do we need to develop corpora in Indian languages? In: the International Working Conference on Sharing Capability in Localisation and Human Language Technologies SCALLA-2001. Bangalore (2001)

  24. Al-Shalabi, R., Evens, M.: A computational morphology system for Arabic. In: Workshop on Computational Approaches to Semitic Languages COLING-ACL98, pp. 66–72. Montreal (1998)

  25. The Unicode consortium: Arabic, range: 0600-06ff. The Unicode Standard, Version 5 (2007)

  26. The Unicode consortium: The Unicode standard, version 4.1.0. In: pp. 195–206. Boston, MA, Addison-Wesley (2003)

  27. The Unicode consortium: Arabic shaping The Unicode Standard, Version 5 (2006)

  28. Khorsheed M.S.: Off-line Arabic character recognition—a review. Pattern Anal. Appl. 5(1), 31–45 (2002)

    Article  MathSciNet  Google Scholar 

  29. Fahmy M.M.M., Ali S.A.: Automatic recognition of handwritten Arabic characters using their geometrical features. J. Stud. Inform. Control Emphasis Useful Appl. Adv. Technol. 10(2), 81–98 (2001)

    Google Scholar 

  30. Buckwalter, T.: Issues in Arabic orthography and morphology analysis. In: The 20th International Conference on Computational Linguistics, COLING 2004, pp. 31–34. Geneva, Switzerland (2004)

  31. Harty R., Ghaddar C.: Arabic text recognition. Int. Arab. J. Inf. Technol. 1(2), 156–163 (2004)

    Google Scholar 

  32. Contributors, W.: Code page. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/w/index.php?title=Code_page&oldid=87192444. Accessed 22 /01/07 (2006)

  33. Beebe N.H.F.: Character set encoding. TUGboat 11(2), 171–175 (1990)

    Google Scholar 

  34. arabo.com: Arabo Arab search engine and dictionary. http://www.arabo.com/. Accessed 12-01-07 (2005)

  35. Ltd SP: Webzip 7.0. 7.0 edn (2006)

  36. Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: 25th International Conference on Research and Development in Information Retrieval (SIGIR), pp. 269–274 (2002)

  37. Kanungo, T., Resnik, P.: The bible, truth, and multilingual OCR evaluation. In: the SPIE Conference on Document Recognition and Retrieval VI, pp. 86–96. San Jose, CA (1999)

  38. Chang Y., Chen D., Zhang Y., Yang J.: An image-based automatic Arabic translation system. Pattern Recognit. 42(9), 2127–2134 (2009)

    Article  MATH  Google Scholar 

  39. Kanoun, S., Alimi, A.M., Lecourtier, Y.: Affixal approach for Arabic decomposable vocabulary recognition: A validation on printed word in only one font. In: The Eight International Conference on Document Analysis and Recognition (ICDAR’05), pp. 1025–1029. Seoul, Korea (2005)

  40. Box G.E.P., Muller M.E.: A note on the generation of random normal deviates. Ann. Math. Stat. 29(2), 610–611 (1958)

    Article  MATH  Google Scholar 

  41. Sonka, M., Hlavac, V., Boyle, R.: Image Processing: Analysis and Machine Vision, 2nd edition edn. Thomson Learning Vocational (1998)

  42. Hartley, R.T., Crumpton, K.: Quality of OCR for degraded text images. In: The Fourth ACM Conference on Digital Libraries, pp. 228–229 Berkeley, California, United States (1999)

  43. Leea C.H., Kanungob T.: The architecture of trueViz: a grounDTRUth=metadata editing and vIsualiZing toolKit. Pattern Recognit. 36(3), 811–825 (2003)

    Article  Google Scholar 

  44. Mehran, R., Pirsiavash, H., Razzazi, F.: A front-end OCR for omni-font Persian/Arabic cursive printed documents. In: Digital Image Computing: Techniques and Applications (DICTA’05), pp 56–64. Cairns, Australia (2005)

  45. Najoua, B.A., Noureddine, E.: A robust approach for Arabic printed character segmentation. In: Third International Conference on Document Analysis and Recognition (ICDAR’95), pp. 865–868. Montreal, Canada, (1995)

  46. Bushofa B.M.F., Spann M.: Segmentation and recognition of Arabic characters by structural classification. Image Vis Comput. 15(3), 167–179 (1997)

    Article  Google Scholar 

  47. Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)

    Article  Google Scholar 

  48. Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. D-Lib Mag. 15(7/8), (2009)

  49. Buckwalter, T.: Arabic word frequency counts. http://www.qamus.org/wordlist.htm. Accessed 28/01/07 (2002)

  50. Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora. In: The 5th ACL Workshop on Very Large Corpora, pp. 231–245. Beijing and Hong Kong (1997)

  51. AL-Ma’adeed, S., Elliman, D., Higgins, C.A.: A data base for Arabic handwritten text recognition research. In: Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 485–489 Ontario, Canada (2002)

  52. Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H.: IFN/ENIT—Database of handwritten Arabic words. In: The 7th Colloque International Francophone sur l’Ecrit et le Document, CIFED 2002, pp. 129–136. Hammamet, Tunisia (2002)

  53. Mashali, S., Mahmoud, A., Elnemr, H., Ahmed, G., Osama, S.: Arabic OCR database development. In: The Fifth Conference on Language Engineering, pp. 250–283. Cairo, Egypt (2005)

  54. Hyams, D.G.: CurveExpert 1.3, a comprehensive curve fitting system for windows (2005)

  55. Gu, B., Hu, F., Liu, H.: Modelling classification performance for large data sets, an empirical study. In: Advances in web-age information management: second international conference, waim 2001, pp. 317–328. xi’an, china (2001)

  56. contributors, W.: Romanization of Arabic. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/wiki/Arabic_transliteration. Accessed 27/01/07 (2006)

  57. contributors, W.: Arabic chat alphabet. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/wiki/Arabic_Chat_Alphabet. Accessed 27/01/07 (2006)

  58. Palfreyman, D., Khalil, M.a.: A funky language for teenzz to use: representing gulf Arabic in instant messaging. J. Comput. Mediat. Commun. 9(1) (2003)

  59. Buckwalter, T.: Buckwalter Arabic transliteration. http://www.qamus.org/transliteration.htm. Accessed 28/01/07 (2002)

  60. Ananthakrishnan, S., Bangalore, S., Narayanan, S.: Automatic diacritization of Arabic transcripts for automatic speech recognition. In: International Conference on Natural Language Processing. Kanpur, India (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashraf AbdelRaouf.

Rights and permissions

Reprints and permissions

About this article

Cite this article

AbdelRaouf, A., Higgins, C.A., Pridmore, T. et al. Building a multi-modal Arabic corpus (MMAC). IJDAR 13, 285–302 (2010). https://doi.org/10.1007/s10032-010-0128-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-010-0128-2

Keywords

Navigation