Skip to main content
Log in

Exploring and exploiting a historical corpus for Arabic

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents a historical Arabic corpus named HAC. At this early embryonic stage of the project, we report about the design, the architecture and some of the experiments which we have conducted on HAC. The corpus, and accordingly the search results, will be represented using a primary XML exchange format. This will serve as an intermediate exchange tool within the project and will allow the user to process the results offline using some external tools. HAC is made up of Classical Arabic texts that cover 1600 years of language use; the Quranic text, Modern Standard Arabic texts, as well as a variety of monolingual Arabic dictionaries. The development of this historical corpus assists linguists and Arabic language learners to effectively explore, understand, and discover interesting knowledge hidden in millions of instances of language use. We used techniques from the field of natural language processing to process the data and a graph-based representation for the corpus. We provided researchers with an export facility to render further linguistic analysis possible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. http://nlp.stanford.edu/software/tagger.shtml.

References

  • A Representative Corpus of Historical English Registers (ARCHER). (2014). http://www.alc.manchester.ac.uk/subjects/lel/research/projects/archer/using-archer. Accessed 15 January 2015.

  • Abbès, R., & Dichy, J. (2008). AraConc, an Arabic concordance software based on the DIINAR.1 language resource. In The 6th international conference on informatics and systems, pp. 127–134.

  • Abu-Salem, H., Al-Omari, M., & Evens, M. W. (1999). Stemming methodologies over individual query words for an Arabic information retrieval system. Journal of the American Society for Information Science, 50(6), 524–529.

    Article  Google Scholar 

  • Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic (ICA): Progress of compilation stage. In 7th international conference on language engineering, Cairo, Egypt.

  • Alansary, S., Nagi, M., & Adly, N. (2008). Towards analyzing the international corpus of Arabic (ICA): Progress of morphological stage. In 8th international conference on language engineering, Cairo, Egypt.

  • Alrabiah, M., Al-Salman, A., & Atwell, E. (2013). The design and construction of the 50 million words KSUCCA. In The proceedings of the second workshop on arabic corpus linguistics (WACL-2), Lancaster University, UK.

  • Al-Sulaiti, L., & Atwell, E. S. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.

    Article  Google Scholar 

  • Al-Thubaity, A. O. (2014). A 700 M + Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation,. doi:10.1007/s10579-014-9284-1.

    Google Scholar 

  • Attia, M., Pecina, P., Tounsi, L., Toral, A., & van Genabith, J. (2011). Lexical profiling for Arabic. In Proceedings of eLex, pp. 23–33.

  • Boella, M., Romani, F., Al-Raies, A., Solimando, C., & Lancioni, G. (2011). The SALAH project: Segmentation and linguistic analysis of Ḥadīṯ Arabic texts. Information Retrieval Technology, pp. 538–549.

  • Buckwalter, T. (2004). Buckwalter Arabic morphological analyzer version 2.0 linguistic data consortium, Philadelphia. http://www.qamus.org/morphology.htm. Accessed 15 January 2015.

  • Dhaif, Shawqi. (1986). Tarikh Al-Adab Al-Arabi: Al-Asr Al-Jahili. Cairo: Dar Al-Maarif.

    Google Scholar 

  • Dukes, K., & Habash, N. (2010). Morphological annotation of Quranic Arabic. In LREC.

  • Hajjar, M., Al-Hajjar, A., Zreik, K., & Gallinari, P. (2010). An improved structured and progressive electronic dictionary for the Arabic language: iSPEDAL. In Fifth international conference on internet and web applications and services (ICIW), pp. 489–495.

  • Hammo, B. (2009). Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents. Information Retrieval, 12(3), 300–323.

    Article  Google Scholar 

  • Hammo, B., Abuleil, S., Lytinen, S., & Evens, M. (2004). Experimenting with a question answering system for the Arabic language. Computers and the Humanities, 38(4), 397–415.

    Article  Google Scholar 

  • Hammo, B., Abu-Salem, H., & Lytinen, S. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of the ACL-02 workshop on computational approaches to Semitic languages, pp. 1–11.

  • Hammo, B., Al-Shargi, F., Yagi, S. & Obeid, N. (2013). Developing tools for Arabic corpus for researchers. In The proceedings of the second workshop on Arabic corpus linguistics (WACL-2), Lancaster University, UK.

  • Helsinki Corpus of English Texts. (2011). Department of Modern Languages, University of Helsinki. http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/HC_XML.html. Accessed 15 January 2015.

  • Hourani, A. (2013). A history of the Arab peoples: Updated edition. London: Faber and Faber.

    Google Scholar 

  • Ide, N., Patrice, B., & Laurent, R. (2000). XCES: An XML-based standard for linguistic corpora. In Proceedings of the second language resources and evaluation conference (LREC).

  • Khoja, S., & Garside, R. (1999). Stemming Arabic text. Computing Department, Lancaster University: Lancaster.

    Google Scholar 

  • König, E., & Siemund, P. (1999). Intensifiers as targets and sources of semantic change. In Andreas Blank & Peter Koch (Eds.), Historical semantics and cognition. Berlin: Walter de Gruyter.

    Google Scholar 

  • Nelson, F. W., & Kuĉera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin.

    Google Scholar 

  • Pilz, T., Ernst-Gerlach, A., Kempken, S., Rayson, P., & Archer, D. (2008). The identification of spelling variants in English and German historical texts: Manual or automatic? Literary and Linguistic Computing, 23(1), 65–72.

    Article  Google Scholar 

  • Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies, 5(2), 1–157.

    Article  Google Scholar 

  • Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora. In Proceedings of corpus linguistics 2007, University of Birmingham, UK.

  • Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora, 1(1), 39–60.

    Article  Google Scholar 

  • Rögnvaldsson, E., & Helgadóttir, S. (2008). Morphological tagging of Old Norse texts and its use in studying syntactic variation and change. In Proceedings of the LREC 2008 workshop on language technology for cultural heritage data (LaTeCH 2008). ELRA, Paris.

  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24, 513–523.

    Article  Google Scholar 

  • Sánchez-Marco, C., Boleda Torrent, G., Fontana, J. M., & Domingo, J. (2010). Annotation and representation of a diachronic corpus of Spanish. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10), Malta.

  • Schacht, J., & Bosworth, C. E. (1974). The legacy of Islam. Oxford: Oxford University Press.

    Google Scholar 

  • Sharaf, A. & Atwell, E. (2012). QurAna: Corpus of the Quran annotated with Pronominal Anaphora. In LREC, pp. 130–137.

  • Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL 2003, pp. 252–259.

  • Yagi, S. & Ghodhaya, M. (2014). Culture from a historical semantic perspective. Al-Majalla Al-Thaqafiya, 85, University of Jordan, pp. 86–119.

  • Yang, Y. M. (1995). Noise reduction in a statistical approach to text categorization. In Proceedings of SIGIR-95, 18th ACM international conference on research and development in information retrieval, pp. 256–263.

Download references

Acknowledgments

The authors would like to thank the graduate students of the Linguistics Department at the University of Jordan for their help in compiling the Arabic historical corpus. Also we would like to sincerely thank the anonymous reviewers of the first submission for their thoughtful comments to enhance our work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bassam Hammo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hammo, B., Yagi, S., Ismail, O. et al. Exploring and exploiting a historical corpus for Arabic. Lang Resources & Evaluation 50, 839–861 (2016). https://doi.org/10.1007/s10579-015-9304-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9304-9

Keywords

Navigation