Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds

Alzahrani, Salha M.

doi:10.1007/978-3-642-45068-6_42

Salha M. Alzahrani²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8281))

Included in the following conference series:

Asia Information Retrieval Symposium

1476 Accesses
2 Citations

Abstract

The aim of this paper is to give a detailed and explicit design, composition and documentation of a new Arabic News Corpus (ArNeCo). We used RSS feeds from Google news as a big container of article titles, and crawled the web to extract the text. About 11,000 documents with more than 6 million words were tagged as belonging to one of 6 domains: Business, Entertainment, Health, Science-Technology, Sports, and World. Metadata has been added to the corpus as a whole and to each domain independently. The developed corpus, called ArNeCo, has been analysed to ensure that it has a considerable quality and quantity, and published on the Internet for research purposes. This article aims to help potential users of ArNeCo to understand the nature of the corpus and to do information retrieval research in many ways such as in the formulation of queries, justification of decisions taken or interpretation of results gained. Besides the corpus, this article presents a method for developing corpora that can keep track of recent natural language texts posted on the Internet by using RSS feeds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abdelali, A., Cowie, J., et al.: Building A Modern Standard Arabic Corpus. In: Workshop on Computational Modeling of Lexical Acquisition, Croatia (2005)
Google Scholar
Al-Sulaiti, L., Atwell, E.S.: The design of a corpus of Contemporary Arabic. International Journal of Corpus Linguistics 11(2), 135–171 (2006)
Article Google Scholar
Alansary, S., Nagi, M., et al.: Building an International Corpus of Arabic (ICA): Progress of Compilation Stage. In: 7th International Conference on Language Engineering, Cairo, Egypt (2007)
Google Scholar
Alotaiby, F., Alkharashi, I., et al.: Processing Large Arabic Text Corpora: Preliminary Analysis and Results. In: Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt (2009)
Google Scholar
Alzahrani, S.M., Salim, N.: On the Use of Fuzzy Information Retrieval for Gauging Similarity of Arabic Documents. In: Second International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2009), London Metropolitan University, UK (2009)
Google Scholar
Goweder, A., De Roeck, A.: Assessment of a significant Arabic corpus. In: Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France (2001)
Google Scholar
Graff, D.: Arabic Gigaword, 3rd edn. Linguistic Data Consortium, Philadelphia (2007)
Google Scholar
Graff, D., Walker, K.: Arabic Newswire. Linguistic Data Consortium, Philadelphia (2001)
Google Scholar
Hmeidi, I., Kanaan, G., et al.: Design and implementation of automatic indexing for information retrieval with Arabic documents. Journal of the American Society for Information Science 48(10), 867–881 (1997)
Article Google Scholar
Sarkar, A., De Roeck, A., et al.: Easy measures for evaluating non-English corpora for language engineering: Some lessons from Arabic and Bengali. Technical Report Number 2004/05 Department of Computing (2004)
Google Scholar
Wynne, M.: Developing Linguistic Corpora: a Guide to Good Practice. Oxbow Books, Oxford (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
Salha M. Alzahrani

Authors

Salha M. Alzahrani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Infocomm Research, Human Language Technology, 1 Fusionopolis Way #21-01, Connexis South, 138632, Singapore
Rafael E. Banchs , Min Zhang & Sheng Gao , &
Yahoo Labs, Avinguda Diagonal 177, 08018, Barcelona, Spain
Fabrizio Silvestri
Microsoft Research Asia, No. 5, Danling Street, Haidian District, 100080, Beijing, China
Tie-Yan Liu
Institute for Infocomm Research, Human Language Technology, 1 Fusionopolis Way #21-01, Connexis South,, 138632, Singapore
Jun Lang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alzahrani, S.M. (2013). Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-45068-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45067-9
Online ISBN: 978-3-642-45068-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics