Skip to main content

Comparative evaluation of tools for Arabic corpora search and analysis

Abstract

As the number of Arabic corpora is constantly increasing, there is an obvious and growing need for concordancing software for corpus search and analysis that supports as many features as possible of the Arabic language, and provides users with a greater number of functions. This paper evaluates six existing corpus search and analysis tools based on eight criteria which seem to be the most essential for searching and analysing Arabic corpora, such as displaying Arabic text in its right-to-left direction, normalising diacritics and Hamza, and providing an Arabic user interface. The results of the evaluation revealed that three tools: Khawas, Sketch Engine, and aConCorde, have met most of the evaluation criteria and achieved the highest benchmark scores. The paper concluded that developers’ conscious consideration of the linguistic features of Arabic when designing these three tools was the most significant factor behind their superiority.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    The ALC may be accessed here http://www.arabiclearnercorpus.com.

  2. 2.

    The manual can be accessed here http://www.lexically.net/wordsmith/step_by_step_Arabic6/index.html.

References

  1. Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic (ICA): Progress of compilation stage. Paper presented at the seventh conference of language engineering ESOLEC (5–6 December 2007), Cairo, Egypt.

  2. Alfaifi, A., Atwell, E., & Hedaya, I. (2014). Arabic learner corpus (ALC) v. 2: A new written and spoken corpus of Arabic learners. In S. Ishikawa (Ed.), Learner corpus studies in Asia and the World (Vol. 2, pp. 77–89). Papers from LCSAW2014. Kobe: School of Languages and Communication, Kobe University.

  3. Al-Khalifa, H., & Al-Thubaity, A. (Eds.) (2014). In Proceedings of the workshop on free/open-source Arabic corpora and corpora processing tools, Reykjavik, Iceland. http://www.kacstac.org.sa/osact/proceedings.rar.

  4. Al-Sulaiti, L. (2010). Arabic corpora. The University of Leeds, Latifa Al-Sulaiti’s Homepage: http://www.comp.leeds.ac.uk/eric/latifa/arabic_corpora.htm.

  5. Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11, 135–171.

    Article  Google Scholar 

  6. Al-Thubaity, A., & Al-Mazrua, M. (2014). Khawas: Arabic Corpora Processing Tool USER GUIDE. Retrieved April 6, 2014, from http://www.sourceforge.net/projects/kacst-acptool/files/?source=navbar.

  7. Al-Thubaity, A., Khan, M., Al-Mazrua, M., & Almoussa, M. (2013). New language resources for Arabic Corpora containing more than two million words and a corpus processing tool. In Proceedings of IALP international conference on Asian language processing, Urumqui, Xinjiang Uyghur Autonomous Region, China (pp. 67–70).

  8. Al-Thubaity, A., Khan, M., Al-Mazrua, M., & Almoussa, M. (2014). KACST Arabic Corpora Processing Tool “Khawas” [Computer Software]. Retrieved April 6, 2014, from http://www.kacst-acptool.sourceforge.net/.

  9. AntConc-discussion. (2013). AntConc and Arabic Texts. Retrieved September 20, 2014, from https://www.groups.google.com/d/msg/antconc/7v3TrtW2LiE/DySK9GIzPooJ.

  10. Anthony, L. (2005). AntCone: design and development of a freeware corpus analysis toolkit for the technical writing classroom. In Proceedings of IPCC international professional communication conference, Limerick (pp. 729–737).

  11. Anthony, L. (2014a). AntConc, (Version 3.4.2) [Computer Software]. Tokyo, Japan: Waseda University. http://www.antlab.sci.waseda.ac.jp/.

  12. Anthony, L. (2014b). AntConc 3.4.2Readme. Tokyo, Japan: Waseda University. http://www.laurenceanthony.net/software/antconc341/AntConc_readme.pdf.

  13. Atwell, E., & Hardie, A. (Eds.) (2013). In Proceedings of WACL’2, 22nd to 26th July 2013. Lancaster: Lancaster University. http://www.comp.leeds.ac.uk/eric/wacl/wacl2proceedings.pdf.

  14. Atwell, E.S., Al-Sulaiti, L., Al-Osaimi, S., & Abu Shawar, B. A. (2004). A review of Arabic corpus analysis tools—un examen d’outils pour l’analyse de corpora Arabes. In B. Bel & I. Marlien (Eds.) Proceedings of TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles (Vol. 2, pp. 229–234).

  15. Burnard, L. (2005). Metadata for corpus work. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 30–46). Oxford: Oxbow Books.

    Google Scholar 

  16. Habash, N. (2010). Introduction to Arabic natural language processing. In G. Hirst (Ed.), Synthesis lectures on human language technologies. San Rafael, CA: Morgan and Claypool.

    Google Scholar 

  17. Kilgarriff, A. (2014). Sketch engine [Computer Software]. Retrieved April 6, 2014, from http://www.sketchengine.co.uk/.

  18. Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In Proceedings of the Euralex, 610 July 2004, pp. 105–116, Lorient, France.

  19. Roberts, A. (2014). aConCorde [Computer Software]. Retrieved April 6, 2014, from http://www.andy-roberts.net/coding/aconcorde.

  20. Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora (Vol. 1, pp. 39–60).

  21. Samy, W., & Samy, L. (2014). Basic arabic: A grammar and workbook. London: Routledge.

    Google Scholar 

  22. Scott, M. (2008). Developing Wordsmith. International Journal of English Studies, 8(1), 95–106.

    Google Scholar 

  23. Scott, M. (2012). WordSmith Tools version 6 [Computer Software], Liverpool: Lexical Analysis Software. Retrieved September 16, 2014, from http://www.lexically.net/wordsmith.

  24. Sharoff, S. (2014). IntelliText Corpus Queries [Computer Software]. Retrieved April 6, 2014, from http://www.corpus.leeds.ac.uk/itweb/htdocs/Query.html.

  25. Sketch Engine. (2014). Overview of language integration in Sketch Engine. Retrieved September 22, 2014, from https://www.sketchengine.co.uk/documentation/wiki/LanguagesOverview.

  26. Wiechmann, D., & Fuhs, S. (2006). Concordancing software. Corpus Linguistics and Linguistic Theory Journal, 2(1), 107–127.

    Google Scholar 

  27. Wilson, J., Hartley, A., Sharoff, S., & Stephenson, P. (2010). Advanced corpus solutions for humanities researchers. In Proceedings of PACLIC 24, Sendai, Japan.

  28. WordSmith Tools. (2013). WordSmith Tools Manual. Retrieved September 22, 2014, from http://www.lexically.net/downloads/version6/HTML/index.html?language.htm.

Download references

Acknowledgments

The authors would like to thank the developers, Abdulmohsen Althubaity, Andrew Roberts, Laurence Anthony, Mike Scott, Adam Kilgarriff and James Wilson for their valuable comments and suggestions to improve the quality of the paper.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Abdullah Alfaifi.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Alfaifi, A., Atwell, E. Comparative evaluation of tools for Arabic corpora search and analysis. Int J Speech Technol 19, 347–357 (2016). https://doi.org/10.1007/s10772-015-9285-5

Download citation

Keywords

  • Arabic
  • Tool
  • Corpus
  • Evaluation
  • Analysis