Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

El-Haj, Mahmoud; Kruschwitz, Udo; Fox, Chris

doi:10.1007/s10579-014-9274-3

Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

Original Paper
Published: 09 August 2014

Volume 49, pages 549–580, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Mahmoud El-Haj¹,
Udo Kruschwitz² &
Chris Fox²

1011 Accesses
31 Citations
1 Altmetric
Explore all metrics

Abstract

Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

Tharawat: A Vision for a Comprehensive Resource for Arabic Computational Processing

The Large Annotated Corpus for the Arabic Language (LACAL)

Notes

Other language cited in the literature as suffering from a lack of resources include Sorani Kurdish (Walther and Sagot 2010) and Swahili (Nganga 2012). As can be seen from the META-NET whitepaper series (http://www.meta-net.eu/whitepapers/key-results-and-cross-language-comparison) some European languages also suffer from weak, or no support.
http://www.ircs.upenn.edu/arabic/
http://ufal.mff.cuni.cz/padt/PADT_1.0/
Some contributing reasons for the shortage of resources and tools for Arabic may in part be due to its complex morphology, the absence of diacritics (vowels) in written text and the fact that Arabic does not use capitalisation, which causes problems for named entity recognition (Benajiba et al. 2009).
http://www.nist.gov/tac/
http://duc.nist.gov/
http://www.nist.gov/tac/2011/Summarization/
http://www.mturk.com/
See also Albakour et al. (2010).
http://www.wikipedia.org/
http://corpus.quran.com/
http://www.internetworldstats.com/
http://www.sakhr.com/
http://www.rdi-eg.com/technologies/arabic_nlp.htm
Other languages also suffer from a lack of resources for plagiarism detection, including Basque for example (Barrón-Cedeño et al. 2010).
An example of a large website with liberal copyright terms is Wikipedia (http://www.wikipedia.org/). It is important to note that corpora drawn from such sites still need to ensure that the terms of the copyright are being followed.
An example of such a source is the Wikinews website (http://www.wikinews.org/). To illustrate a potential problem, we were granted the use of news articles from a large UK-based website, but our rights did not permit further distribution of those articles. These articles had to be removed from the public versions of our datasets.
There are some Bayesian techniques that might be adapted to help address this issue, but that lay outside the scope of our work (for example, Carpenter 2008, and related work).
http://www.alrai.com/
http://www.alwatan.com.sa/
Originally it was planned to include news articles from the BBC website (http://www.bbc.co.uk/news/). We obtained permission to use these articles for our work on summarisation. Unfortunately the terms of use were such that it would have been impractical to distribute the corpus to others. For this reason, these articles and associated summaries were excluded from the final corpus.
Appendix 1 shows the guidelines given to the workers for completing the task in addition to a HIT example. Payments made to the users were dependent on the document size, ranging from £\(0.04\) to £\(0.33\) per task with an estimate overall cost of £\(200\).
To illustrate the thresholds, assume we provided three participants with a document made of six sentences. Following the EASC summarisation guidelines in "Appendix 1", the users must provide a selection of up to three sentences as a summary. For example, consider the following selections, where A, B and C are three random participants and the numbers identify the sentences they selected: \(A\)(1,3,5), \(B\)(1,2,3), \(C\)(1,4,5). The three gold-standard summaries will be as follows: the Level 3 summary will consist of sentence \(1\) only, Level 2 summary will contain sentences \(1\), \(3\) and \(5\). And finally the All summary will contain sentences \(1,2,3,4\) and \(5\).
EASC can be obtained from http://sourceforge.net/projects/easc-corpus/. To simplify the use of EASC when evaluating summarisers, the file names and extensions are formatted to be compatible with evaluation systems such as ROUGE (Lin 2004) and AutoSummENG (Giannakopoulos et al. 2008). It is also available in two character encodings, UTF-8 and ISO-8859-6 (Arabic).
The DUC-2002 dataset was as provided by the National Institute of Standards and Technology (NIST—http://www.nist.gov/index.html) through the Document Understanding Conference (DUC).
See http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html. Note that there are some discrepancies in the published statistics for the corpus: NIST withdrew some of the documents, leaving just 59 reference sets rather than the original 60.
http://trec.nist.gov/pubs/trec9/t9_proceedings.html
http://code.google.com/p/google-api-translate-java/
http://www.nist.gov/tac/2011/Summarization/index.html
http://www.wikinews.org/
Although the gold-standard summaries were extractive in nature, the TAC-2011 MultiLing Summarisation Pilot allowed participating systems to use other approaches.
The corpus can be downloaded directly after completing the relevant forms, as required by the MultiLing organisers (http://multiling.iit.demokritos.gr/file/all/).

References

Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic wordnet coverage and usability. Language Resources and Evaluation, 47(3), 891–917.
Article Google Scholar
Abuleil, S., Alsamara, K., & Evens, M. (2002). Acquisition system for Arabic noun morphology. In Proceedings of the ACL-02 workshop on computational approaches to semitic languages, association for computational linguistics. Stroudsburg, PA, SEMITIC’02, pp. 1–8.
Aker, A., & Gaizauskas, R. J. (2010). Model summaries for location-related images. In The 7th international language resources and evaluation conference (LREC 2010), LREC 2010 (pp. 3119–3124). Malta: Valletta.
Aker, A., El-Haj, M., Kruschwitz, U., & Albakour, D. (2012). Assessing crowdsourcing quality through objective tasks. In 8th Language resources and evaluation conference, LREC 2012, Istanbul, Turkey.
Al-Ameed, H., Al-Ketbi, S., Al-Kaabi, A., Al-Shebli, K., Al-Shamsi, N., Al-Nuaimi, N., & Al-Muhairi, S. (2006). Arabic light stemmer: A new enhanced approach. In The 2nd international conference on innovations in information technology, IIT’05, Dubai, United Arab Emirates.
Al-Shammari, E., & Lin, J. (2008). Towards an error-free Arabic stemming. In: F. Lazarinis, E. Efthimiadis, J. Vilares, & J. Tait (Eds.) Proceeding of the 2nd ACM workshop on improving non English web searching, iNEWS 2008, Napa Valley, California, USA, October 30, 2008, ACM, pp. 9–16.
Al-Sulaiti, L., Atwell, E., & Steven, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.
Article Google Scholar
Albakour, M., Kruschwitz, U., & Lucas, S. (2010). Sentence-level attachment prediction. In: H. Cunningham, A. Hanbury, S. Rüger (Eds.), Advances in multidisciplinary retrieval, lecture notes in computer science (Vol. 6107, pp. 6–19). Berlin: Springer
Alghamdi, M., Chafic, M., & Mohamed, M. (2009). Arabic language resources and tools for speech and natural language: KACST and Balamand. In The 2nd international conference on Arabic language resources and tools, Cairo, Egypt.
Alonso, O., & Mizzaro, S. (2009). Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In SIGIR ’09: Workshop on the future of IR evaluation.
Althobaiti, M., Kruschwitz, U., & Poesio, M. (2014). AraNLP: A Java-based library for the processing of Arabic text. In Proceedings of the 9th language resources and evaluation conference (LREC), Reykjavik.
Attia, M. (2007). Arabic tokenization system. In Proceedings of the 2007 workshop on computational approaches to semitic languages: Common issues and resources, association for computational linguistics, Stroudsburg, PA, USA, Semitic ’07, pp. 65–72.
Banzhaf, W., Francone, F., Keller, R., & Nordin, P. (1998). Genetic programming: An introduction—On the automatic evolution of computer programs and its applications. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Book Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation (LREV), 43(3), 209–226. doi:10.1007/s10579-009-9081-4.
Article Google Scholar
Barrera, A., & Verma, R. (2011). Automated extractive single-document summarization: Beating the baselines with a new approach. In Proceedings of the 2011 ACM symposium on applied computing, ACM, TaiChung, Taiwan, SAC’11, pp. 268–269.
Barrón-Cedeño, A., Rosso, P., Agirre, E., & Labaka, G. (2010). Plagiarism detection across distant language pairs. In Proceedings of the 23rd international conference on computational linguistics, Stroudsburg, PA, USA, COLING ’10( pp. 37–45). Beijing, China: Association for Computational Linguistics.
Beesley, K. (1998). Arabic morphology using only finite-state operations. In Proceedings of the workshop on computational approaches to semitic languages, association for computational linguistics, Stroudsburg, PA, USA, Semitic ’98, pp. 50–57.
Benajiba, Y., Diab, M., & Rosso, P. (2009). Arabic named entity recognition: A feature-driven study. Audio, Speech, and Language Processing, IEEE Transactions on, 17(5), 926–934.
Article Google Scholar
Benajiba, Y., Zitouni, I., Diab, M., & Rosso, P. (2010). Arabic named entity recognition: using features extracted from noisy data. In Proceedings of the ACL 2010 conference short papers, association for computational linguistics, pp. 281–285.
Benmamoun, E. (2007). The syntax of Arabic tense. Cahiers de Linguistique de L’INALCO, 5, 9–25.
Google Scholar
Bensalem, I., Rosso, P., & Chikhi, S. (2013). A new corpus for the evaluation of arabic intrinsic plagiarism detection. In P. Forner, H. Müller, R. Paredes, P. Rosso, B. Stein (Eds.), Information access evaluation. Multilinguality, multimodality, and visualization, lecture notes in computer science (Vol 8138, pp. 53–58). Berlin: Springer.
Bossard, A., & Rodrigues, C. (2010). Combining a multi-document update summarization system “CBSEAS” with a genetic algorithm. In International workshop on combinations of intelligend methods and applications, hyper articles en Ligne, Arras, France, CIMA 2010.
Boyer, A., & Brun, A. (2007). Natural language processing for usage based indexing of web resources. In G. Amati, C. Carpineto, G. Romano (Eds.), Advances in information retrieval, lecture notes in computer science (Vol. 4425, pp. 517–524). Berlin: Springer. doi:10.1007/978-3-540-71496-5_46.
Buckwalter, T., & Parkinson, D. (2011). A frequency dictionary of Arabic: Core vocabulary for learners. Routledge Frequency Dictionaries, Routledge, http://books.google.co.uk/books?id=Kj_NRwAACAAJ
Buhay, E., Evardone, M., Nocon, H., Dimalen, D., & Roxas, R. (2010). Autolex: An automatic lexicon builder for minority languages using an open corpus. In: PACLIC’10, pp. 603–611.
Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 1, association for computational linguistics, Stroudsburg, PA, USA, EMNLP ’09, pp. 286–295.
Calzolari, N., Soria, C., Gratta, R.D., Goggi, S., V.Q., Russo, I., Choukri, K., Mariani, J., & Piperidis, S. (2010). The LREC 2010 resource map. In The 7th international language resources and evaluation conference (LREC 2010), LREC 2010, Valletta, Malta, pp. 949–956.
Carpenter, B. (2008). Multilevel bayesian models of categorical data annotation. Available at http://lingpipe-blog.com/lingpipe-white-papers
de Chalendar, G., & Nouvel, D. (2009). Modular resource development and diagnostic evaluation framework for fast nlp system improvement. In Proceedings of the workshop on software engineering, testing, and quality assurance for natural language processing, association for computational linguistics, Stroudsburg, PA, USA, SETQA-NLP ’09, pp. 65–73.
Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M., & Poesio, M. (2013). Using games to create language resources: Successes and limitations of the approach. In The people’s web meets NLP, theory and applications of natural language processing: Springer, pp. 3–44.
Chiarcos, C., Eckart, K., & Ritz, J. (2010). Creating and exploiting a resource of parallel parses. In Proceedings of the fourth linguistic annotation workshop, association for computational linguistics, stroudsburg, PA, USA, LAW IV ’10, pp. 166–171.
Darwish, K., Hassan, H., & Emam, O. (2005). Examining the effect of improved context sensitive morphology on Arabic information retrieval. In Proceedings of the ACL workshop on computational approaches to semitic languages, association for computational linguistics, Stroudsburg, PA, USA, Semitic ’05, pp. 25–30.
Diab, M., Hacioglu, K., & Jurafsky, D. (2007). Automatic processing of modern standard Arabic text. In A. Soudi, A. van den Bosch, & G. Neumann (Eds.), Arabic computational morphology: Knowledge-based and empirical methods, text, speech and language technology (pp. 159–179). Netherlands: Springer.
Chapter Google Scholar
Diehl, F., Gales, M., Tomalin, M., & Woodland, P. (2012). Morphological decomposition in Arabic ASR systems. Computer Speech and Language, 26(4), 229–243.
Article Google Scholar
Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on computational linguistics, association for computational linguistics, Stroudsburg, PA, USA, COLING ’04.
Dukes, K., Atwell, E., & Habash, N. (2013). Supervised collaboration for syntactic annotation of Quranic Arabic. Language Resources and Evaluation Journal (LREV), 47(1), 33–62. special Issue on Collaboratively Constructed Language Resources.
Article Google Scholar
El-Haj, M., Kruschwitz, U., & Fox, C. (2010). Using Mechanical Turk to create a corpus of Arabic summaries. In Language resources (LRs) and human language technologies (HLT) for semitic languages workshop held in conjunction with the 7th international language resources and evaluation conference (LREC 2010), LREC 2010 (pp. 36–39). Malta: Valletta.
El-Haj, M., Kruschwitz, U., & Fox, C. (2011a). Exploring clustering for multi-document Arabic summarisation. In M. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa (Eds.), The 7th Asian information retrieval societies (AIRS 2011), lecture notes in computer science (Vol. 7097, pp. 550–561). Berlin: Springer.
El-Haj, M., Kruschwitz, U., & Fox, C. (2011b). Multi-document Arabic text summarisation. In The 3rd computer science and electronic engineering conference (CEEC’11), IEEE Xplore, Colchester, UK.
El-Haj., M., Kruschwitz., U., & Fox, C. (2011c). University of Essex at the TAC 2011 multilingual summarisation pilot. In Text analysis conference (TAC) (2011), MultiLing summarisation pilot, TAC, Maryland, USA.
Fattah, M., & Ren, F. (2008). Automatic text summarization. Proceedings of World Academy of Science, World Academy of Science, 27, 192–195.
Google Scholar
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database (Language, Speech, and Communication), illustrated edition. Cambridge: The MIT Press.
Google Scholar
Foster, I., Kesselman, C., Nick, J., & Tuecke, S. (2002). Grid services for distributed system integration. Computer, 35(6), 37–46.
Article Google Scholar
Fukumoto, F., Sakai, A., & Suzuki, Y. (2010). Eliminating redundancy by spectral relaxation for multi-document summarization. In Proceedings of the 2010 workshop on graph-based methods for natural language processing, association for computational linguistics, Stroudsburg, PA, USA, TextGraphs-5, pp. 98–102.
Getao, K., & Miriti, E. (2006). Automatic construction of a kiswahili corpus from the world wide web. In Measuring computing research excellence and vitality, pp. 209–219.
Giannakopoulos, G., & Karkaletsis, V. (2011). AutoSummENG and MeMoG in evaluating guided summaries. In The proceedings of the text analysis conference, TAC, MD, USA.
Giannakopoulos, G., Karkaletsis, V., Vouros, G., & Stamatopoulos, P. (2008). Summarization system evaluation revisited: N-gram graphs. ACM Transactions on Speech and Language Processing (TSLP), 5(3), 1–39.
Article Google Scholar
Giannakopoulos, G., El-Haj, M., Favre, B., Litvak, M., Steinberger, J., & Varma, V. (2011). TAC 2011 multiling pilot overview. In Text analysis conference (TAC) 2011, multiLing summarisation pilot, TAC, Maryland, USA.
Graff, D. (2003). Arabic Gigaword. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2003T12, ISBN:1-58563-271-6.
Graff, D., Chen, K., Kong, J., & Maeda, K. (2006). Arabic Gigaword second edition. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2006T02, ISBN:1-58563-371-2.
Green, N., Larasati, S., & Žabokrtský, Z. (2012). Indonesian dependency treebank: Annotation and parsing. In Proceedings of the 26th Pacific Asia conference on language, information, and computation, faculty of computer science, Universitas Indonesia, Bali, Indonesia, pp. 137–145, http://www.aclweb.org/anthology/Y12-1014.
Guevara, E. (2010). Nowac: a large web-based corpus for norwegian. In Proceedings of the NAACL HLT 2010 sixth web as corpus workshop, association for computational linguistics, Stroudsburg, PA, USA, WAC-6 ’10, pp. 1–7.
Habash, N., & Roth, R. (2011). Using deep morphology to improve automatic error detection in Arabic handwriting recognition. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies–Volume 1, association for computational linguistics, Stroudsburg, PA, USA, HLT ’11, pp. 875–884.
Haddad, B., & Yaseen, M. (2005). A compositional approach towards semantic representation and construction of ARABIC. In Proceedings of the 5th international conference on logical aspects of computational linguistics, Springer-Verlag, Berlin, Heidelberg, LACL’05, pp. 147–161.
Hajic, J., Smrz, O., Zemanek, P., Pajas, P., Snaidauf, J., Beska, E., Kracmar, J., & Hassanova, K. (2004). Prague Arabic dependency treebank 1.0. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2004T23, ISBN: 1-58563-319-4.
Halpern, J. (2006). The contribution of lexical resources to natural language processing of cjk languages. In Q. Huo, B. Ma, E. S. Chng, H. Li (Eds.), Chinese spoken language processing, lecture notes in computer science (Vol. 4274, pp. 768–780). Berlin: Springer. doi: 10.1007/11939993_77.
Hendrickx, I., Daelemans, W., Marsi, E., & Krahmer, E. (2009). Reducing redundancy in multi-document summarization using lexical semantic similarity. In Proceedings of the 2009 workshop on language generation and summarisation, association for computational linguistics, Stroudsburg, PA, USA, UCNLG+Sum ’09, pp. 63–66.
Hmeidi, I., Al-Shalabi, R., Al-Taani, A., Najadat, H., & Al-Hazaimeh, S. (2010). A novel approach to the extraction of roots from Arabic words using bigrams. Journal of the American Society for Information Science and Technology, 61(3), 583–591.
Google Scholar
Howe, J. (2008). Crowdsourcing: Why the power of the crowd is driving the future of business. Crown Publishing Group.
Huang, S., Graff, D., & Doddington, G. (2002). Multiple-translation chinese corpus. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2002T01, ISBN:1-58563-217-1.
Jing, H., & McKeown, K. (1998). Combining multiple, large-scale resources in a reusable lexicon for natural language generation. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics—volume 1, association for computational linguistics, Stroudsburg, PA, USA, ACL ’98, pp. 607–613. doi: 10.3115/980845.980946.
Kaisser, M., & Lowe, J. (2008). Creating a research collection of question answer sentence pairs with amazon’s mechanical turk. In Proceedings of the 6th international conference on language resources and evaluation, LREC, Marrakech, Morocco.
Katragadda, R., Pingali, P., & Varma, V. (2009). Sentence position revisited: A robust light-weight update summarization ’baseline’ algorithm. In Proceedings of the third international workshop on cross lingual information access CLIAWS3’09, association for computational linguistics, Morristown, NJ, USA, pp. 46–52.
Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011). Crowdsourcing for book search evaluation: Impact of Hit design on comparative system ranking. In Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’11, pp. 205–214.
Kilgarriff, A., Charalabopoulou, F., Gavrilidou, M., Johannessen, J., Khalil, S., Johansson Kokkinakis, S., Lew, R., Sharoff, S., Vadlapudi, R., & Volodina, E. (2013). Corpus-based vocabulary lists for language learners for nine languages. Language Resources and Evaluation (LREV) pp. 1–43, doi: 10.1007/s10579-013-9251-2.
Kittur, A., Smus, B., Khamkar, S., & Kraut, R. (2011). CrowdForge: Crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on user interface software and technology, ACM, New York, NY, USA, UIST ’11, pp. 43–52.
Kozareva, Z., & Hovy, E. (2013). Tailoring the automated construction of large-scale taxonomies using the web. Language Resources and Evaluation (LREV) 47(3):859–890. doi:10.1007/s10579-013-9229-0.
Larkey, L., Ballesteros, L., & Connell, M. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’02, pp. 275–282.
Li, P., Zhu, Q., Qian, P., & Fox, G. (2007). Constructing a large scale text corpus based on the grid and trustworthiness. In Proceedings of the 10th international conference on text, speech and dialogue, Springer, Berlin, Heidelberg, TSD’07, pp. 56–65.
Lin, C. (2004). ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), WAS 2004), pp. 25–26.
Lloret, E., Plaza, L., & Aker, A. (2013). Analyzing the capabilities of crowdsourcing services for text summarization. Language Resources and Evaluation (LREV), 47(2), 337–369.
Article Google Scholar
Luhn, H. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–165.
Article Google Scholar
Maamouri, M., Bies, A., Jin, H., & Buckwalter, T. (2003). Arabic treebank: Part 1 v 2.0. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2003T06, ISBN:1-58563-261-9.
Maamouri, M., Bies, A., Buckwalter, T., & Jin, H. (2004). Arabic treebank: Part 2 v 2.0. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2004T02, ISBN:1-58563-282-1.
Maamouri, M., Bies, A., Buckwalter, T., Jin, H., & Mekki, W. (2005). Arabic treebank: Part 3 (full corpus) v 2.0 (MPG + syntactic analysis). Linguistic Data Consortium, Philadelphia, lDC Catalogue number: LDC2005T20, ISBN:1-58563-341-0.
Maegaard, B., Atiyya, M., Choukri, K., Krauwer, S., Mokbel, C., & Yaseen, M. (2008). Medar: Collaboration between European and Mediterranean Arabic partners to support the development of language technology for Arabic. In Proceedings of the 6th international conference on language resources and evaluation, LREC, Marrakech, Morocco.
Marcus, M., Marcinkiewicz, M., & Santorini, B. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19(2):313–330, http://dl.acm.org/citation.cfm?id=972470.972475.
Marsi, E., & Krahmer, E. (2013). Construction of an aligned monolingual treebank for studying semantic similarity. Language Resources and Evaluation (LREV) pp. 1–28, doi:10.1007/s10579-013-9252-1.
Mourad, A., & Darwish, K. (2013). Subjectivity and sentiment analysis of Modern Standard Arabic and Arabic microblogs. In Proceedings of the 4th workshop on computational approaches to subjectivity, sentiment and social media analysis, association for computational linguistics, Atlanta, Georgia, pp. 55–64, http://www.aclweb.org/anthology/W13-1608.
Nemeskey, D., & Simon, E. (2012). Automatically generated ne tagged corpora for english and hungarian. In Proceedings of the 4th named entity workshop, association for computational linguistics, Stroudsburg, PA, USA, NEWS ’12, pp. 38–46.
Nenkova, A. (2005). Automatic text summarization of newswire: Lessons learned from the document understanding conference. In Proceedings of the 20th national conference on artificial intelligence—Volume 3, AAAI Press, AAAI’05, pp. 1436–1441.
Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In: C. C. Aggarwal, C. Zhai (Eds.), Mining text data, Springer, pp. 43–76. doi:10.1007/978-1-4614-3223-4_3.
Nganga, W. (2012). Building Swahili resource grammars for the grammatical framework. In Shall we play the Festschrift game? Berlin: Springer.
Nguyen, P., Vu, X., Nguyen, T., Nguyen, V., & Le, H. (2009). Building a large syntactically-annotated corpus of vietnamese. In Proceedings of the third linguistic annotation workshop, association for computational linguistics, Stroudsburg, PA, USA, ACL-IJCNLP ’09, pp. 182–185.
Outahajala, M., Benajiba, Y., Rosso, P., & Zenkouar, L. (2011). Pos tagging in Amazighe using support vector machines and conditional random fields. Natural Language Processing and Information Systems (pp. 238–241). Berlin: Springer.
Chapter Google Scholar
Poesio, M., Chamberlain, J., Kruschwitz, U., Robaldo, L., & Ducceschi, L. (2013). Phrase detectives: Utilizing collective intelligence for internet-scale language resource creation. ACM Transactions on Interactive Intelligent Systems, 3(1), 3:1–3:44.
Article Google Scholar
Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., & Rosso, P., et al. (2013). Overview of the 5th international competition on plagiarism detection. In CLEF 2013 workshop on uncovering plagiarism, authorship, and social software Misuse (PAN-13), Valencia, Spain, pp. 1–30.
Prochazka, S. (2006). “Arabic” encyclopedia of language and linguistics, vol 1 (2nd ed.). Elsevier.
Ptaszynski, M., Rzepka, R., Araki, K., & Momouchi, Y. (2012). Automatically annotating a five-billion-word corpus of japanese blogs for affect and sentiment analysis. In Proceedings of the 3rd workshop in computational approaches to subjectivity and sentiment analysis, association for computational linguistics, Stroudsburg, PA, USA, WASSA ’12, pp. 89–98.
Radev, D., Jing, H., & Budzikowska, M. (2000). Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the 2000 NAACL-ANLP workshop on automatic summarization—Volume 4, association for computational linguistics, Stroudsburg, PA, USA, NAACL-ANLP-AutoSum ’00, pp. 21–30.
Radev, D., Jing, H., Sty, M., & Tam, D. (2004). Centroid-based summarization of multiple documents. Information Processing and Management, 40, 919–938. doi:10.1016/j.ipm.2003.10.006.
Article Google Scholar
Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora, 1(1), 39–60. doi:10.3366/cor.2006.1.1.39.
Article Google Scholar
Sarkar, K. (2009). Centroid-based summarization of multiple documents. TECHNIA: International Journal of Computing Science and Communication Technologies 2.
Sawalha, M., & Atwell, E. (2010). Constructing and using broad-coverage lexical resource for enhancing morphological analysis of Arabic. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), The 7th language resources and evaluation conference LREC, LREC 2010 (pp. 282–287). Malta: Valletta.
Google Scholar
Sawalha, M., & Atwell, E. (2010). Fine-grain morphological analyzer and part-of-speech tagger for Arabic text. The 7th language resources and evaluation conference LREC, LREC 2010 (pp. 1258–1265). Malta: Valletta.
Schalley, A. (2012). Ontology and the lexicon: a natural language processing perspective. (studies in natural language processing.). Language Resources and Evaluation (LREV), 46(1), 95–100. doi:10.1007/s10579-011-9138-z.
Article Google Scholar
Sekine, S., & Nobata, C. (2003). A survey for multi-document summarization. In Proceedings of the HLT-NAACL 03 on text summarization workshop—Volume 5, association for computational linguistics, Stroudsburg, PA, USA, HLT-NAACL-DUC ’03, pp. 65–72.
Smrž, O. (2007). ElixirFM: Implementation of functional Arabic morphology. In Proceedings of the 2007 workshop on computational approaches to semitic languages: Common issues and resources, association for computational linguistics, Stroudsburg, PA, USA, Semitic ’07, pp. 1–8.
Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing, association for computational linguistics, pp. 254–263.
Walther, G., & Sagot, B. (2010). Developing a large-scale lexicon for a less-resourced language: General methodology and preliminary experiments on Sorani Kurdish. In Proceedings of the 7th SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languages (LREC 2010 Workshop), Valetta, Malta.
Wang, D., & Li, T. (2012). Weighted consensus multi-document summarization. Information Processing and Management, 48(3), 513–523.
Article Google Scholar
Wang, S., Li, W., Wang, F., & Deng, H. (2010). A survey on automatic summarization. In 2010 International forum on information technology and applications (IFITA) (Vol. 1, pp. 193–196).
Wilks, Y., Fass, D., Guo, C., McDonald, J., Plate, T., & Slator, B. (1988). Machine tractable dictionaries as tools and resources for natural language processing. In Proceedings of the 12th conference on computational linguistics–Volume 2, association for computational linguistics, Stroudsburg, PA, USA, COLING ’88, pp. 750–755, doi:10.3115/991719.991789.
Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., & Papadias, D. (2009). Query by document. In Proceedings of the second ACM international conference on web search and data mining, ACM, New York, NY, USA, WSDM ’09, pp. 34–43.
Yaseen, M., & Theophilopoulos, N. (2001). NAPLUS: Natural Arabic processing for language understanding systems.
Yeh, J., Ke, H., & Yang, W. (2008). iSpreadRank: Ranking sentences for extraction-based summarization using feature weight propagation in the sentence similarity network. Expert Systems with Applications, 35(3), 1451–1462.
Article Google Scholar
zu Meyer, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker, H. Lenz (Eds.), Advancesin data analysis (Proceedings of the 30th Annual conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, March 8–10, 2006), Springer, Berlin Heidelberg, Studies in classification, data analysis, and knowledge organization, pp. 359–366.

Download references

Author information

Authors and Affiliations

School of Computing and Communications, Lancaster University, Lancaster, UK
Mahmoud El-Haj
CSEE, University of Essex, Colchester, UK
Udo Kruschwitz & Chris Fox

Authors

Mahmoud El-Haj
View author publications
You can also search for this author in PubMed Google Scholar
Udo Kruschwitz
View author publications
You can also search for this author in PubMed Google Scholar
Chris Fox
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chris Fox.

Additional information

The current paper describes and extends the resource creation activities and evaluations that underpinned experiments and findings that have previously appeared as an LREC workshop paper (El-Haj et al. 2010), a student conference paper (El-Haj et al. 2011b), and a description of a multilingual summarisation pilot (El-Haj et al. 2011c; Giannakopoulos et al. 2011).

Appendices

Appendix 1: EASC corpus guidelines appendix

1.1 Creating EASC corpus guidelines

The Mechanical Turk workers were given the following guidelines for completing the task of creating the single-document summaries corpus (Sect. 3).

1.
Read the Arabic sentences (Document).
2.
In the first text box below type the number of sentences that you think are focusing on the main idea of the document.
3.
The number of the selected sentences should not exceed 50% of the articles‘ sentences, for example if there are 5 sentences you can only chose up to 2 sentences.
4.
Just add the numbers of the sentences, please do not write text in the first text box. For example (1,2,4).
5.
In the second text box write down one to three Keyword(s) that represent the main idea of the documents, or keywords that can be used as title for the article, do not exceed 3 keywords.
6.
If you have any comments please add them to the third text box below.
7.
Failing to follow the guides correctly could lead to task rejection
8.
NOTE: The article chosen for this task was selected randomly from the Internet. The purpose of this task is purely educational and does not reflect, support or contradict with any opinion or point of view.

Figure 5, shows an example for one of the HITs provided to the workers on the Mechanical Turk website. The reason for the selection of an answer field (instead of a checkbox or radio button) is that we aimed to reduce the noise and track spammers. We think that a design with e.g. radio buttons is not able to distinguish between MTurks as spammers and MTurks who produce noise (wrong selection but not produced by random procedure as it is the case by spammers), for example if the worker wrote “two” instead of “2”, we still consider this as a valid answer. Using radio buttons, when selection is made it is not clear whether the MTurk worker has selected it because he thought it is a correct answer or just by random. However, if we force an MTurk to write down the answer then this gives us the possibility to distinguish between spammers and noise. A spammer would give answers which are composed by random characters and/or numbers whereas a noise could be close to the right answer.

Appendix 2: TAC-2011 dataset guidelines appendix

1.1 Creating TAC-2011 dataset guidelines

The following task guidelines were required by the participants to create a manual corpus for TAC-2011 MultiLing Pilot:

1.
Translation: Given the source language text \(A\), the translator is requested to translate each sentence in \(A\), into the target language. Each target sentence should keep the meaning from the source language. The resulting text would be a UTF8 encoded plain text file, named \(A.[lang]\), where \([lang]\) should be replaced by the target language. For each text the following check list should be followed:
- The translator notes down the starting time for the reading step.
- The translator reads the source text at least once to get an understanding.
- The translator notes down the starting time for the translation step.
- Perform the translation.
- The translator notes down the finishing time for the translation step.
2.
Translation Validation: After the completion of each translation, another translator “validator” should verify the correctness of the output. If errors are found, then the validator is to perform any corrections and finalise the translation. For each text the following check list should be followed:
- The translator notes down the starting time for the verification step.
- Read the translation and verify the text. Perform any corrections needed.
- The translator notes down the finishing time for the verification step.
3.
Summarisation: The summariser will read the whole set of texts at least once. Then, the summariser should compose a summary, with a minimum size of 240 and a maximum size of 250 words. The summary should be in the same language as the texts in the set. The aim is to create a summary that covers all the major points of the document set (what is major is left to summariser discretion). The summary should be written using fluent, easily readable language. No formatting or other markup should be included in the text. The output summary should be a self-sufficient, clearly written text, providing no other information than what is included in the source documents. For each document set the following check list should be followed:
- The summariser notes down the starting time for the reading step.
- Read the whole set of texts in the document set at least once, to have an overall understanding of the event(s) described.
- The summariser notes down the starting time for the summarisation step.
- The summariser writes the summary, reviewing the source texts, if required.
- The summariser notes down the end time for the summarisation step.
4.
Evaluation: Each summary will be graded by 3 evaluators. If the summarisers are used as evaluators, no self-evaluation should be allowed. Evaluators read each translated document set at least once. Then they read the summary they are to evaluate, and they grade it. Each summary is to be assigned an integer grade from 1 to 5, related to the overall responsiveness of the summary. We consider a text to be worth a 5, if it appears to cover all the important aspects of the corresponding document set using fluent, readable language. A text should be assigned a 1, if it is either unreadable, nonsensical, or contains only trivial information from the document set. We consider the content and the quality of the language to be equally important in the grading.

Appendix 3: Evaluating Arabic summaries guidelines

1.1 Evaluating EASC corpus summaries guidelines

1.
Read a document at least once.
2.
In the Evaluation form note the time (in minutes) it took you to read the document. Figure 6, shows a sample of the form provided to the evaluators.
3.
Read the document’s summary you are to evaluate
4.
In the Evaluation form grade the summary with a value between 1 and 5.
- Each summary is to be assigned an integer grade from 1 to 5, related to the overall responsiveness of the summary.
- A summary is worth a 5, if it appears to cover all the important aspects of the corresponding document using fluent, readable language.
- A summary is worth 1, if it is either unreadable, nonsensical, or contains only trivial information from the document.
- We consider the content and the quality of the language to be equally important in the grading.

Rights and permissions

Reprints and permissions

About this article

Cite this article

El-Haj, M., Kruschwitz, U. & Fox, C. Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Lang Resources & Evaluation 49, 549–580 (2015). https://doi.org/10.1007/s10579-014-9274-3

Download citation

Published: 09 August 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10579-014-9274-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

Abstract

Access this article

Similar content being viewed by others

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

Tharawat: A Vision for a Comprehensive Resource for Arabic Computational Processing

The Large Annotated Corpus for the Arabic Language (LACAL)

Notes

References