Advertisement

Language Resources and Evaluation

, Volume 49, Issue 3, pp 549–580 | Cite as

Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

  • Mahmoud El-Haj
  • Udo Kruschwitz
  • Chris FoxEmail author
Original Paper

Abstract

Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented.

Keywords

Resources Summarisation Arabic Under-resourced languages 

References

  1. Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic wordnet coverage and usability. Language Resources and Evaluation, 47(3), 891–917.CrossRefGoogle Scholar
  2. Abuleil, S., Alsamara, K., & Evens, M. (2002). Acquisition system for Arabic noun morphology. In Proceedings of the ACL-02 workshop on computational approaches to semitic languages, association for computational linguistics. Stroudsburg, PA, SEMITIC’02, pp. 1–8.Google Scholar
  3. Aker, A., & Gaizauskas, R. J. (2010). Model summaries for location-related images. In The 7th international language resources and evaluation conference (LREC 2010), LREC 2010 (pp. 3119–3124). Malta: Valletta.Google Scholar
  4. Aker, A., El-Haj, M., Kruschwitz, U., & Albakour, D. (2012). Assessing crowdsourcing quality through objective tasks. In 8th Language resources and evaluation conference, LREC 2012, Istanbul, Turkey.Google Scholar
  5. Al-Ameed, H., Al-Ketbi, S., Al-Kaabi, A., Al-Shebli, K., Al-Shamsi, N., Al-Nuaimi, N., & Al-Muhairi, S. (2006). Arabic light stemmer: A new enhanced approach. In The 2nd international conference on innovations in information technology, IIT’05, Dubai, United Arab Emirates.Google Scholar
  6. Al-Shammari, E., & Lin, J. (2008). Towards an error-free Arabic stemming. In: F. Lazarinis, E. Efthimiadis, J. Vilares, & J. Tait (Eds.) Proceeding of the 2nd ACM workshop on improving non English web searching, iNEWS 2008, Napa Valley, California, USA, October 30, 2008, ACM, pp. 9–16.Google Scholar
  7. Al-Sulaiti, L., Atwell, E., & Steven, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.CrossRefGoogle Scholar
  8. Albakour, M., Kruschwitz, U., & Lucas, S. (2010). Sentence-level attachment prediction. In: H. Cunningham, A. Hanbury, S. Rüger (Eds.), Advances in multidisciplinary retrieval, lecture notes in computer science (Vol. 6107, pp. 6–19). Berlin: SpringerGoogle Scholar
  9. Alghamdi, M., Chafic, M., & Mohamed, M. (2009). Arabic language resources and tools for speech and natural language: KACST and Balamand. In The 2nd international conference on Arabic language resources and tools, Cairo, Egypt.Google Scholar
  10. Alonso, O., & Mizzaro, S. (2009). Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In SIGIR ’09: Workshop on the future of IR evaluation.Google Scholar
  11. Althobaiti, M., Kruschwitz, U., & Poesio, M. (2014). AraNLP: A Java-based library for the processing of Arabic text. In Proceedings of the 9th language resources and evaluation conference (LREC), Reykjavik.Google Scholar
  12. Attia, M. (2007). Arabic tokenization system. In Proceedings of the 2007 workshop on computational approaches to semitic languages: Common issues and resources, association for computational linguistics, Stroudsburg, PA, USA, Semitic ’07, pp. 65–72.Google Scholar
  13. Banzhaf, W., Francone, F., Keller, R., & Nordin, P. (1998). Genetic programming: An introduction—On the automatic evolution of computer programs and its applications. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.CrossRefGoogle Scholar
  14. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation (LREV), 43(3), 209–226. doi: 10.1007/s10579-009-9081-4.CrossRefGoogle Scholar
  15. Barrera, A., & Verma, R. (2011). Automated extractive single-document summarization: Beating the baselines with a new approach. In Proceedings of the 2011 ACM symposium on applied computing, ACM, TaiChung, Taiwan, SAC’11, pp. 268–269.Google Scholar
  16. Barrón-Cedeño, A., Rosso, P., Agirre, E., & Labaka, G. (2010). Plagiarism detection across distant language pairs. In Proceedings of the 23rd international conference on computational linguistics, Stroudsburg, PA, USA, COLING ’10( pp. 37–45). Beijing, China: Association for Computational Linguistics.Google Scholar
  17. Beesley, K. (1998). Arabic morphology using only finite-state operations. In Proceedings of the workshop on computational approaches to semitic languages, association for computational linguistics, Stroudsburg, PA, USA, Semitic ’98, pp. 50–57.Google Scholar
  18. Benajiba, Y., Diab, M., & Rosso, P. (2009). Arabic named entity recognition: A feature-driven study. Audio, Speech, and Language Processing, IEEE Transactions on, 17(5), 926–934.CrossRefGoogle Scholar
  19. Benajiba, Y., Zitouni, I., Diab, M., & Rosso, P. (2010). Arabic named entity recognition: using features extracted from noisy data. In Proceedings of the ACL 2010 conference short papers, association for computational linguistics, pp. 281–285.Google Scholar
  20. Benmamoun, E. (2007). The syntax of Arabic tense. Cahiers de Linguistique de L’INALCO, 5, 9–25.Google Scholar
  21. Bensalem, I., Rosso, P., & Chikhi, S. (2013). A new corpus for the evaluation of arabic intrinsic plagiarism detection. In P. Forner, H. Müller, R. Paredes, P. Rosso, B. Stein (Eds.), Information access evaluation. Multilinguality, multimodality, and visualization, lecture notes in computer science (Vol 8138, pp. 53–58). Berlin: Springer.Google Scholar
  22. Bossard, A., & Rodrigues, C. (2010). Combining a multi-document update summarization system “CBSEAS” with a genetic algorithm. In International workshop on combinations of intelligend methods and applications, hyper articles en Ligne, Arras, France, CIMA 2010.Google Scholar
  23. Boyer, A., & Brun, A. (2007). Natural language processing for usage based indexing of web resources. In G. Amati, C. Carpineto, G. Romano (Eds.), Advances in information retrieval, lecture notes in computer science (Vol. 4425, pp. 517–524). Berlin: Springer. doi: 10.1007/978-3-540-71496-5_46.
  24. Buckwalter, T., & Parkinson, D. (2011). A frequency dictionary of Arabic: Core vocabulary for learners. Routledge Frequency Dictionaries, Routledge, http://books.google.co.uk/books?id=Kj_NRwAACAAJ
  25. Buhay, E., Evardone, M., Nocon, H., Dimalen, D., & Roxas, R. (2010). Autolex: An automatic lexicon builder for minority languages using an open corpus. In: PACLIC’10, pp. 603–611.Google Scholar
  26. Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 1, association for computational linguistics, Stroudsburg, PA, USA, EMNLP ’09, pp. 286–295.Google Scholar
  27. Calzolari, N., Soria, C., Gratta, R.D., Goggi, S., V.Q., Russo, I., Choukri, K., Mariani, J., & Piperidis, S. (2010). The LREC 2010 resource map. In The 7th international language resources and evaluation conference (LREC 2010), LREC 2010, Valletta, Malta, pp. 949–956.Google Scholar
  28. Carpenter, B. (2008). Multilevel bayesian models of categorical data annotation. Available at http://lingpipe-blog.com/lingpipe-white-papers
  29. de Chalendar, G., & Nouvel, D. (2009). Modular resource development and diagnostic evaluation framework for fast nlp system improvement. In Proceedings of the workshop on software engineering, testing, and quality assurance for natural language processing, association for computational linguistics, Stroudsburg, PA, USA, SETQA-NLP ’09, pp. 65–73.Google Scholar
  30. Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M., & Poesio, M. (2013). Using games to create language resources: Successes and limitations of the approach. In The people’s web meets NLP, theory and applications of natural language processing: Springer, pp. 3–44.Google Scholar
  31. Chiarcos, C., Eckart, K., & Ritz, J. (2010). Creating and exploiting a resource of parallel parses. In Proceedings of the fourth linguistic annotation workshop, association for computational linguistics, stroudsburg, PA, USA, LAW IV ’10, pp. 166–171.Google Scholar
  32. Darwish, K., Hassan, H., & Emam, O. (2005). Examining the effect of improved context sensitive morphology on Arabic information retrieval. In Proceedings of the ACL workshop on computational approaches to semitic languages, association for computational linguistics, Stroudsburg, PA, USA, Semitic ’05, pp. 25–30.Google Scholar
  33. Diab, M., Hacioglu, K., & Jurafsky, D. (2007). Automatic processing of modern standard Arabic text. In A. Soudi, A. van den Bosch, & G. Neumann (Eds.), Arabic computational morphology: Knowledge-based and empirical methods, text, speech and language technology (pp. 159–179). Netherlands: Springer.CrossRefGoogle Scholar
  34. Diehl, F., Gales, M., Tomalin, M., & Woodland, P. (2012). Morphological decomposition in Arabic ASR systems. Computer Speech and Language, 26(4), 229–243.CrossRefGoogle Scholar
  35. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on computational linguistics, association for computational linguistics, Stroudsburg, PA, USA, COLING ’04.Google Scholar
  36. Dukes, K., Atwell, E., & Habash, N. (2013). Supervised collaboration for syntactic annotation of Quranic Arabic. Language Resources and Evaluation Journal (LREV), 47(1), 33–62. special Issue on Collaboratively Constructed Language Resources.CrossRefGoogle Scholar
  37. El-Haj, M., Kruschwitz, U., & Fox, C. (2010). Using Mechanical Turk to create a corpus of Arabic summaries. In Language resources (LRs) and human language technologies (HLT) for semitic languages workshop held in conjunction with the 7th international language resources and evaluation conference (LREC 2010), LREC 2010 (pp. 36–39). Malta: Valletta.Google Scholar
  38. El-Haj, M., Kruschwitz, U., & Fox, C. (2011a). Exploring clustering for multi-document Arabic summarisation. In M. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa (Eds.), The 7th Asian information retrieval societies (AIRS 2011), lecture notes in computer science (Vol. 7097, pp. 550–561). Berlin: Springer.Google Scholar
  39. El-Haj, M., Kruschwitz, U., & Fox, C. (2011b). Multi-document Arabic text summarisation. In The 3rd computer science and electronic engineering conference (CEEC’11), IEEE Xplore, Colchester, UK.Google Scholar
  40. El-Haj., M., Kruschwitz., U., & Fox, C. (2011c). University of Essex at the TAC 2011 multilingual summarisation pilot. In Text analysis conference (TAC) (2011), MultiLing summarisation pilot, TAC, Maryland, USA.Google Scholar
  41. Fattah, M., & Ren, F. (2008). Automatic text summarization. Proceedings of World Academy of Science, World Academy of Science, 27, 192–195.Google Scholar
  42. Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database (Language, Speech, and Communication), illustrated edition. Cambridge: The MIT Press.Google Scholar
  43. Foster, I., Kesselman, C., Nick, J., & Tuecke, S. (2002). Grid services for distributed system integration. Computer, 35(6), 37–46.CrossRefGoogle Scholar
  44. Fukumoto, F., Sakai, A., & Suzuki, Y. (2010). Eliminating redundancy by spectral relaxation for multi-document summarization. In Proceedings of the 2010 workshop on graph-based methods for natural language processing, association for computational linguistics, Stroudsburg, PA, USA, TextGraphs-5, pp. 98–102.Google Scholar
  45. Getao, K., & Miriti, E. (2006). Automatic construction of a kiswahili corpus from the world wide web. In Measuring computing research excellence and vitality, pp. 209–219.Google Scholar
  46. Giannakopoulos, G., & Karkaletsis, V. (2011). AutoSummENG and MeMoG in evaluating guided summaries. In The proceedings of the text analysis conference, TAC, MD, USA.Google Scholar
  47. Giannakopoulos, G., Karkaletsis, V., Vouros, G., & Stamatopoulos, P. (2008). Summarization system evaluation revisited: N-gram graphs. ACM Transactions on Speech and Language Processing (TSLP), 5(3), 1–39.CrossRefGoogle Scholar
  48. Giannakopoulos, G., El-Haj, M., Favre, B., Litvak, M., Steinberger, J., & Varma, V. (2011). TAC 2011 multiling pilot overview. In Text analysis conference (TAC) 2011, multiLing summarisation pilot, TAC, Maryland, USA.Google Scholar
  49. Graff, D. (2003). Arabic Gigaword. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2003T12, ISBN:1-58563-271-6.Google Scholar
  50. Graff, D., Chen, K., Kong, J., & Maeda, K. (2006). Arabic Gigaword second edition. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2006T02, ISBN:1-58563-371-2.Google Scholar
  51. Green, N., Larasati, S., & Žabokrtský, Z. (2012). Indonesian dependency treebank: Annotation and parsing. In Proceedings of the 26th Pacific Asia conference on language, information, and computation, faculty of computer science, Universitas Indonesia, Bali, Indonesia, pp. 137–145, http://www.aclweb.org/anthology/Y12-1014.
  52. Guevara, E. (2010). Nowac: a large web-based corpus for norwegian. In Proceedings of the NAACL HLT 2010 sixth web as corpus workshop, association for computational linguistics, Stroudsburg, PA, USA, WAC-6 ’10, pp. 1–7.Google Scholar
  53. Habash, N., & Roth, R. (2011). Using deep morphology to improve automatic error detection in Arabic handwriting recognition. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies–Volume 1, association for computational linguistics, Stroudsburg, PA, USA, HLT ’11, pp. 875–884.Google Scholar
  54. Haddad, B., & Yaseen, M. (2005). A compositional approach towards semantic representation and construction of ARABIC. In Proceedings of the 5th international conference on logical aspects of computational linguistics, Springer-Verlag, Berlin, Heidelberg, LACL’05, pp. 147–161.Google Scholar
  55. Hajic, J., Smrz, O., Zemanek, P., Pajas, P., Snaidauf, J., Beska, E., Kracmar, J., & Hassanova, K. (2004). Prague Arabic dependency treebank 1.0. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2004T23, ISBN: 1-58563-319-4.Google Scholar
  56. Halpern, J. (2006). The contribution of lexical resources to natural language processing of cjk languages. In Q. Huo, B. Ma, E. S. Chng, H. Li (Eds.), Chinese spoken language processing, lecture notes in computer science (Vol. 4274, pp. 768–780). Berlin: Springer. doi:  10.1007/11939993_77.
  57. Hendrickx, I., Daelemans, W., Marsi, E., & Krahmer, E. (2009). Reducing redundancy in multi-document summarization using lexical semantic similarity. In Proceedings of the 2009 workshop on language generation and summarisation, association for computational linguistics, Stroudsburg, PA, USA, UCNLG+Sum ’09, pp. 63–66.Google Scholar
  58. Hmeidi, I., Al-Shalabi, R., Al-Taani, A., Najadat, H., & Al-Hazaimeh, S. (2010). A novel approach to the extraction of roots from Arabic words using bigrams. Journal of the American Society for Information Science and Technology, 61(3), 583–591.Google Scholar
  59. Howe, J. (2008). Crowdsourcing: Why the power of the crowd is driving the future of business. Crown Publishing Group.Google Scholar
  60. Huang, S., Graff, D., & Doddington, G. (2002). Multiple-translation chinese corpus. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2002T01, ISBN:1-58563-217-1.Google Scholar
  61. Jing, H., & McKeown, K. (1998). Combining multiple, large-scale resources in a reusable lexicon for natural language generation. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics—volume 1, association for computational linguistics, Stroudsburg, PA, USA, ACL ’98, pp. 607–613. doi:  10.3115/980845.980946.
  62. Kaisser, M., & Lowe, J. (2008). Creating a research collection of question answer sentence pairs with amazon’s mechanical turk. In Proceedings of the 6th international conference on language resources and evaluation, LREC, Marrakech, Morocco.Google Scholar
  63. Katragadda, R., Pingali, P., & Varma, V. (2009). Sentence position revisited: A robust light-weight update summarization ’baseline’ algorithm. In Proceedings of the third international workshop on cross lingual information access CLIAWS3’09, association for computational linguistics, Morristown, NJ, USA, pp. 46–52.Google Scholar
  64. Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011). Crowdsourcing for book search evaluation: Impact of Hit design on comparative system ranking. In Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’11, pp. 205–214.Google Scholar
  65. Kilgarriff, A., Charalabopoulou, F., Gavrilidou, M., Johannessen, J., Khalil, S., Johansson Kokkinakis, S., Lew, R., Sharoff, S., Vadlapudi, R., & Volodina, E. (2013). Corpus-based vocabulary lists for language learners for nine languages. Language Resources and Evaluation (LREV) pp. 1–43, doi:  10.1007/s10579-013-9251-2.
  66. Kittur, A., Smus, B., Khamkar, S., & Kraut, R. (2011). CrowdForge: Crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on user interface software and technology, ACM, New York, NY, USA, UIST ’11, pp. 43–52.Google Scholar
  67. Kozareva, Z., & Hovy, E. (2013). Tailoring the automated construction of large-scale taxonomies using the web. Language Resources and Evaluation (LREV) 47(3):859–890. doi: 10.1007/s10579-013-9229-0.
  68. Larkey, L., Ballesteros, L., & Connell, M. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’02, pp. 275–282.Google Scholar
  69. Li, P., Zhu, Q., Qian, P., & Fox, G. (2007). Constructing a large scale text corpus based on the grid and trustworthiness. In Proceedings of the 10th international conference on text, speech and dialogue, Springer, Berlin, Heidelberg, TSD’07, pp. 56–65.Google Scholar
  70. Lin, C. (2004). ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), WAS 2004), pp. 25–26.Google Scholar
  71. Lloret, E., Plaza, L., & Aker, A. (2013). Analyzing the capabilities of crowdsourcing services for text summarization. Language Resources and Evaluation (LREV), 47(2), 337–369.CrossRefGoogle Scholar
  72. Luhn, H. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–165.CrossRefGoogle Scholar
  73. Maamouri, M., Bies, A., Jin, H., & Buckwalter, T. (2003). Arabic treebank: Part 1 v 2.0. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2003T06, ISBN:1-58563-261-9.Google Scholar
  74. Maamouri, M., Bies, A., Buckwalter, T., & Jin, H. (2004). Arabic treebank: Part 2 v 2.0. Linguistic data consortium, Philadelphia, lDC catalogue number: LDC2004T02, ISBN:1-58563-282-1.Google Scholar
  75. Maamouri, M., Bies, A., Buckwalter, T., Jin, H., & Mekki, W. (2005). Arabic treebank: Part 3 (full corpus) v 2.0 (MPG + syntactic analysis). Linguistic Data Consortium, Philadelphia, lDC Catalogue number: LDC2005T20, ISBN:1-58563-341-0.Google Scholar
  76. Maegaard, B., Atiyya, M., Choukri, K., Krauwer, S., Mokbel, C., & Yaseen, M. (2008). Medar: Collaboration between European and Mediterranean Arabic partners to support the development of language technology for Arabic. In Proceedings of the 6th international conference on language resources and evaluation, LREC, Marrakech, Morocco.Google Scholar
  77. Marcus, M., Marcinkiewicz, M., & Santorini, B. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19(2):313–330, http://dl.acm.org/citation.cfm?id=972470.972475.
  78. Marsi, E., & Krahmer, E. (2013). Construction of an aligned monolingual treebank for studying semantic similarity. Language Resources and Evaluation (LREV) pp. 1–28, doi: 10.1007/s10579-013-9252-1.
  79. Mourad, A., & Darwish, K. (2013). Subjectivity and sentiment analysis of Modern Standard Arabic and Arabic microblogs. In Proceedings of the 4th workshop on computational approaches to subjectivity, sentiment and social media analysis, association for computational linguistics, Atlanta, Georgia, pp. 55–64, http://www.aclweb.org/anthology/W13-1608.
  80. Nemeskey, D., & Simon, E. (2012). Automatically generated ne tagged corpora for english and hungarian. In Proceedings of the 4th named entity workshop, association for computational linguistics, Stroudsburg, PA, USA, NEWS ’12, pp. 38–46.Google Scholar
  81. Nenkova, A. (2005). Automatic text summarization of newswire: Lessons learned from the document understanding conference. In Proceedings of the 20th national conference on artificial intelligence—Volume 3, AAAI Press, AAAI’05, pp. 1436–1441.Google Scholar
  82. Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In: C. C. Aggarwal, C. Zhai (Eds.), Mining text data, Springer, pp. 43–76. doi: 10.1007/978-1-4614-3223-4_3.
  83. Nganga, W. (2012). Building Swahili resource grammars for the grammatical framework. In Shall we play the Festschrift game? Berlin: Springer.Google Scholar
  84. Nguyen, P., Vu, X., Nguyen, T., Nguyen, V., & Le, H. (2009). Building a large syntactically-annotated corpus of vietnamese. In Proceedings of the third linguistic annotation workshop, association for computational linguistics, Stroudsburg, PA, USA, ACL-IJCNLP ’09, pp. 182–185.Google Scholar
  85. Outahajala, M., Benajiba, Y., Rosso, P., & Zenkouar, L. (2011). Pos tagging in Amazighe using support vector machines and conditional random fields. Natural Language Processing and Information Systems (pp. 238–241). Berlin: Springer.CrossRefGoogle Scholar
  86. Poesio, M., Chamberlain, J., Kruschwitz, U., Robaldo, L., & Ducceschi, L. (2013). Phrase detectives: Utilizing collective intelligence for internet-scale language resource creation. ACM Transactions on Interactive Intelligent Systems, 3(1), 3:1–3:44.CrossRefGoogle Scholar
  87. Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., & Rosso, P., et al. (2013). Overview of the 5th international competition on plagiarism detection. In CLEF 2013 workshop on uncovering plagiarism, authorship, and social software Misuse (PAN-13), Valencia, Spain, pp. 1–30.Google Scholar
  88. Prochazka, S. (2006). “Arabic” encyclopedia of language and linguistics, vol 1 (2nd ed.). Elsevier.Google Scholar
  89. Ptaszynski, M., Rzepka, R., Araki, K., & Momouchi, Y. (2012). Automatically annotating a five-billion-word corpus of japanese blogs for affect and sentiment analysis. In Proceedings of the 3rd workshop in computational approaches to subjectivity and sentiment analysis, association for computational linguistics, Stroudsburg, PA, USA, WASSA ’12, pp. 89–98.Google Scholar
  90. Radev, D., Jing, H., & Budzikowska, M. (2000). Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the 2000 NAACL-ANLP workshop on automatic summarization—Volume 4, association for computational linguistics, Stroudsburg, PA, USA, NAACL-ANLP-AutoSum ’00, pp. 21–30.Google Scholar
  91. Radev, D., Jing, H., Sty, M., & Tam, D. (2004). Centroid-based summarization of multiple documents. Information Processing and Management, 40, 919–938. doi: 10.1016/j.ipm.2003.10.006.CrossRefGoogle Scholar
  92. Roberts, A., Al-Sulaiti, L., & Atwell, E. (2006). aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora, 1(1), 39–60. doi: 10.3366/cor.2006.1.1.39.CrossRefGoogle Scholar
  93. Sarkar, K. (2009). Centroid-based summarization of multiple documents. TECHNIA: International Journal of Computing Science and Communication Technologies 2.Google Scholar
  94. Sawalha, M., & Atwell, E. (2010). Constructing and using broad-coverage lexical resource for enhancing morphological analysis of Arabic. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), The 7th language resources and evaluation conference LREC, LREC 2010 (pp. 282–287). Malta: Valletta.Google Scholar
  95. Sawalha, M., & Atwell, E. (2010). Fine-grain morphological analyzer and part-of-speech tagger for Arabic text. The 7th language resources and evaluation conference LREC, LREC 2010 (pp. 1258–1265). Malta: Valletta.Google Scholar
  96. Schalley, A. (2012). Ontology and the lexicon: a natural language processing perspective. (studies in natural language processing.). Language Resources and Evaluation (LREV), 46(1), 95–100. doi: 10.1007/s10579-011-9138-z.CrossRefGoogle Scholar
  97. Sekine, S., & Nobata, C. (2003). A survey for multi-document summarization. In Proceedings of the HLT-NAACL 03 on text summarization workshop—Volume 5, association for computational linguistics, Stroudsburg, PA, USA, HLT-NAACL-DUC ’03, pp. 65–72.Google Scholar
  98. Smrž, O. (2007). ElixirFM: Implementation of functional Arabic morphology. In Proceedings of the 2007 workshop on computational approaches to semitic languages: Common issues and resources, association for computational linguistics, Stroudsburg, PA, USA, Semitic ’07, pp. 1–8.Google Scholar
  99. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing, association for computational linguistics, pp. 254–263.Google Scholar
  100. Walther, G., & Sagot, B. (2010). Developing a large-scale lexicon for a less-resourced language: General methodology and preliminary experiments on Sorani Kurdish. In Proceedings of the 7th SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languages (LREC 2010 Workshop), Valetta, Malta.Google Scholar
  101. Wang, D., & Li, T. (2012). Weighted consensus multi-document summarization. Information Processing and Management, 48(3), 513–523.CrossRefGoogle Scholar
  102. Wang, S., Li, W., Wang, F., & Deng, H. (2010). A survey on automatic summarization. In 2010 International forum on information technology and applications (IFITA) (Vol. 1, pp. 193–196).Google Scholar
  103. Wilks, Y., Fass, D., Guo, C., McDonald, J., Plate, T., & Slator, B. (1988). Machine tractable dictionaries as tools and resources for natural language processing. In Proceedings of the 12th conference on computational linguistics–Volume 2, association for computational linguistics, Stroudsburg, PA, USA, COLING ’88, pp. 750–755, doi: 10.3115/991719.991789.
  104. Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., & Papadias, D. (2009). Query by document. In Proceedings of the second ACM international conference on web search and data mining, ACM, New York, NY, USA, WSDM ’09, pp. 34–43.Google Scholar
  105. Yaseen, M., & Theophilopoulos, N. (2001). NAPLUS: Natural Arabic processing for language understanding systems.Google Scholar
  106. Yeh, J., Ke, H., & Yang, W. (2008). iSpreadRank: Ranking sentences for extraction-based summarization using feature weight propagation in the sentence similarity network. Expert Systems with Applications, 35(3), 1451–1462.CrossRefGoogle Scholar
  107. zu Meyer, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker, H. Lenz (Eds.), Advancesin data analysis (Proceedings of the 30th Annual conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, March 8–10, 2006), Springer, Berlin Heidelberg, Studies in classification, data analysis, and knowledge organization, pp. 359–366.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.School of Computing and CommunicationsLancaster UniversityLancasterUK
  2. 2.CSEEUniversity of EssexColchesterUK

Personalised recommendations