Abstract
The aim of this paper is to give a detailed and explicit design, composition and documentation of a new Arabic News Corpus (ArNeCo). We used RSS feeds from Google news as a big container of article titles, and crawled the web to extract the text. About 11,000 documents with more than 6 million words were tagged as belonging to one of 6 domains: Business, Entertainment, Health, Science-Technology, Sports, and World. Metadata has been added to the corpus as a whole and to each domain independently. The developed corpus, called ArNeCo, has been analysed to ensure that it has a considerable quality and quantity, and published on the Internet for research purposes. This article aims to help potential users of ArNeCo to understand the nature of the corpus and to do information retrieval research in many ways such as in the formulation of queries, justification of decisions taken or interpretation of results gained. Besides the corpus, this article presents a method for developing corpora that can keep track of recent natural language texts posted on the Internet by using RSS feeds.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abdelali, A., Cowie, J., et al.: Building A Modern Standard Arabic Corpus. In: Workshop on Computational Modeling of Lexical Acquisition, Croatia (2005)
Al-Sulaiti, L., Atwell, E.S.: The design of a corpus of Contemporary Arabic. International Journal of Corpus Linguistics 11(2), 135–171 (2006)
Alansary, S., Nagi, M., et al.: Building an International Corpus of Arabic (ICA): Progress of Compilation Stage. In: 7th International Conference on Language Engineering, Cairo, Egypt (2007)
Alotaiby, F., Alkharashi, I., et al.: Processing Large Arabic Text Corpora: Preliminary Analysis and Results. In: Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt (2009)
Alzahrani, S.M., Salim, N.: On the Use of Fuzzy Information Retrieval for Gauging Similarity of Arabic Documents. In: Second International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2009), London Metropolitan University, UK (2009)
Goweder, A., De Roeck, A.: Assessment of a significant Arabic corpus. In: Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France (2001)
Graff, D.: Arabic Gigaword, 3rd edn. Linguistic Data Consortium, Philadelphia (2007)
Graff, D., Walker, K.: Arabic Newswire. Linguistic Data Consortium, Philadelphia (2001)
Hmeidi, I., Kanaan, G., et al.: Design and implementation of automatic indexing for information retrieval with Arabic documents. Journal of the American Society for Information Science 48(10), 867–881 (1997)
Sarkar, A., De Roeck, A., et al.: Easy measures for evaluating non-English corpora for language engineering: Some lessons from Arabic and Bengali. Technical Report Number 2004/05 Department of Computing (2004)
Wynne, M.: Developing Linguistic Corpora: a Guide to Good Practice. Oxbow Books, Oxford (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alzahrani, S.M. (2013). Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_42
Download citation
DOI: https://doi.org/10.1007/978-3-642-45068-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45067-9
Online ISBN: 978-3-642-45068-6
eBook Packages: Computer ScienceComputer Science (R0)