Skip to main content

Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds

  • Conference paper
Information Retrieval Technology (AIRS 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8281))

Included in the following conference series:

Abstract

The aim of this paper is to give a detailed and explicit design, composition and documentation of a new Arabic News Corpus (ArNeCo). We used RSS feeds from Google news as a big container of article titles, and crawled the web to extract the text. About 11,000 documents with more than 6 million words were tagged as belonging to one of 6 domains: Business, Entertainment, Health, Science-Technology, Sports, and World. Metadata has been added to the corpus as a whole and to each domain independently. The developed corpus, called ArNeCo, has been analysed to ensure that it has a considerable quality and quantity, and published on the Internet for research purposes. This article aims to help potential users of ArNeCo to understand the nature of the corpus and to do information retrieval research in many ways such as in the formulation of queries, justification of decisions taken or interpretation of results gained. Besides the corpus, this article presents a method for developing corpora that can keep track of recent natural language texts posted on the Internet by using RSS feeds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Abdelali, A., Cowie, J., et al.: Building A Modern Standard Arabic Corpus. In: Workshop on Computational Modeling of Lexical Acquisition, Croatia (2005)

    Google Scholar 

  • Al-Sulaiti, L., Atwell, E.S.: The design of a corpus of Contemporary Arabic. International Journal of Corpus Linguistics 11(2), 135–171 (2006)

    Article  Google Scholar 

  • Alansary, S., Nagi, M., et al.: Building an International Corpus of Arabic (ICA): Progress of Compilation Stage. In: 7th International Conference on Language Engineering, Cairo, Egypt (2007)

    Google Scholar 

  • Alotaiby, F., Alkharashi, I., et al.: Processing Large Arabic Text Corpora: Preliminary Analysis and Results. In: Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt (2009)

    Google Scholar 

  • Alzahrani, S.M., Salim, N.: On the Use of Fuzzy Information Retrieval for Gauging Similarity of Arabic Documents. In: Second International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2009), London Metropolitan University, UK (2009)

    Google Scholar 

  • Goweder, A., De Roeck, A.: Assessment of a significant Arabic corpus. In: Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France (2001)

    Google Scholar 

  • Graff, D.: Arabic Gigaword, 3rd edn. Linguistic Data Consortium, Philadelphia (2007)

    Google Scholar 

  • Graff, D., Walker, K.: Arabic Newswire. Linguistic Data Consortium, Philadelphia (2001)

    Google Scholar 

  • Hmeidi, I., Kanaan, G., et al.: Design and implementation of automatic indexing for information retrieval with Arabic documents. Journal of the American Society for Information Science 48(10), 867–881 (1997)

    Article  Google Scholar 

  • Sarkar, A., De Roeck, A., et al.: Easy measures for evaluating non-English corpora for language engineering: Some lessons from Arabic and Bengali. Technical Report Number 2004/05 Department of Computing (2004)

    Google Scholar 

  • Wynne, M.: Developing Linguistic Corpora: a Guide to Good Practice. Oxbow Books, Oxford (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Alzahrani, S.M. (2013). Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45068-6_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45067-9

  • Online ISBN: 978-3-642-45068-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics