Advertisement

ARARSS: A System for Constructing and Updating Arabic Textual Resources

  • Abdulmohsen Al-Thubaity
  • Muneera AlhoshanEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 845)

Abstract

The growth of electronically readable Arabic content available on the web has become a rich source from which to build new corpora or update the existing ones. The availability of such corpora will be beneficial for Arabic corpus linguistics, computational linguistics, and natural language processing. In this paper, we present ARARSS, a tool capable of automatically constructing and updating textual corpora benefiting from the Rich Site Summary (RSS) feeds. ARARSS is capable of collecting the texts in a properly categorized manner according to user needs, in addition to their metadata (for example, location, time, and topic) as provided by RSS sources. We used ARARSS to construct a modern standard Arabic corpus comprising 117,819 texts and more than 28 million words. ARARSS is an open source tool and freely available to download (http://corpus.kacst.edu.sa/more_info.jsp) along with the constructed corpus.

Keywords

Language resources Natural language processing Computational linguistics Corpus linguistics Arabic corpora 

References

  1. 1.
    Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Computational Linguistics and Intelligent Text Processing, pp. 171–189. Springer, Heidelberg (2011)Google Scholar
  2. 2.
    Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)Google Scholar
  3. 3.
    Suchomel, V., Pomikálek, J.: Efficient web crawling for large text corpora. In: Proceedings of the Seventh Web as Corpus Workshop (WAC7), pp. 39–43 (2012)Google Scholar
  4. 4.
    Schäfer, R., Bildhauer, F.: Building large corpora from the web using a new efficient tool chain. In: LREC, pp. 486–493 (2012)Google Scholar
  5. 5.
    Barbaresi, A.: Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources. In: Proceedings of the 9th Web as Corpus Workshop, WaC-9, Gothenburg, Sweden, pp. 1–8 (2014)Google Scholar
  6. 6.
    Baroni, M., Bernardini, S.: BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of LREC, p. 1313. ELDA, Lisbon (2004)Google Scholar
  7. 7.
    Ueyama, M.: Evaluation of Japanese web-based reference corpora: effects of seed selection and time interval, Wacky, pp. 99–126 (2006)Google Scholar
  8. 8.
    Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V.: The TenTen corpus family. In: 7th International Corpus Linguistics Conference CL, pp. 125–127. UCREL, Lancaster (2013)Google Scholar
  9. 9.
    Luo, C., Zheng, Y., Liu, Y., Wang, X., Xu, J., Zhang, M., Ma, S.: SogouT-16: a new web corpus to embrace IR research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1233–1236. ACM (2017).  https://doi.org/10.1145/3077136.3080694
  10. 10.
    Schäfer, R.: Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resour. Eval. 51(3), 873–889 (2017).  https://doi.org/10.1007/s10579-016-9359-2CrossRefGoogle Scholar
  11. 11.
    Ringlstetter, C., Schulz, K.U., Mihov, S.: Orthographic errors in web pages: toward cleaner web corpora. Comput. Linguist. 32(3), 295–340 (2006)CrossRefGoogle Scholar
  12. 12.
    Ojokoh, B.A.: Automated online news content extraction. Int. J. Comput. Sci. Res. Appl. 2, 2–12 (2012)Google Scholar
  13. 13.
    George, A., Bouras, C., & Poulopoulos, V.: Efficient extraction of news articles based on RSS crawling. In: International Conference on Machine and Web Intelligence, ICMWI, pp. 1–7. IEEE, Algiers (2010)Google Scholar
  14. 14.
    Qingcheng, L., Youmeng, L.: Extracting content from web pages based on RSS. In: 2008 International Conference on Computer Science and Software Engineering, vol. 5, pp. 218–221. IEEE‏ (2008)Google Scholar
  15. 15.
    Alzahrani, S. M.: Building, profiling, analysing and publishing an Arabic news corpus based on Google news RSS feeds. In: Information Retrieval Technology, pp. 488–499. Springer, Heidelberg (2013)Google Scholar
  16. 16.
    Khoja, S.: An RSS feed analysis application and corpus builder. In: The Second International Conference on Arabic Language Resources and Tools, pp. 01–04. The MEDAR Consortium, Cairo (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.The National Center for Artificial Intelligence and Big DataKing Abdulaziz City for Science and TechnologyRiyadhSaudi Arabia

Personalised recommendations